You are on page 1of 256

CUDA

CUBLAS Li br ar y
PG- 05326- 032_V02
August , 2010
NVI DI A Corporat ion
CUBLAS Library PG-05326-032_V02
Publishedby
NVIDIACorporation
2701SanTomasExpressway
SantaClara,CA95050
Notice
ALLNVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,
LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEING
PROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OR
OTHERWISEWITHRESPECTTOTHEMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIED
WARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULAR
PURPOSE.
Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno
responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorother
rightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationorotherwiseunder
anypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationare
subjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviously
supplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupport
devicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.
Trademarks
NVIDIA,CUDA,andtheNVIDIAlogoaretrademarksorregisteredtrademarksofNVIDIACorporation
intheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksofthe
respectivecompanieswithwhichtheyareassociated.
Copyright
20052010byNVIDIACorporation.Allrightsreserved.
PortionsoftheSGEMM,DGEMM,CGEMM,andZGEMMlibraryroutineswerewrittenbyVasilyVolkov
andaresubjecttotheModifiedBerkeleySoftwareDistributionLicenseasfollows:
Copyright(c)20072009,RegentsoftheUniversityofCalifornia
Allrightsreserved.
Redistributionanduseinsourceandbinaryforms,withorwithoutmodification,arepermittedprovided
thatthefollowingconditionsaremet:
Redistributionsofsourcecodemustretaintheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimer.
Redistributionsinbinaryformmustreproducetheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimerinthedocumentationand/orothermaterialsprovidedwiththedistribution.
NeitherthenameoftheUniversityofCalifornia,Berkeleynorthenamesofitscontributorsmaybeused
toendorseorpromoteproductsderivedfromthissoftwarewithoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHEAUTHORASISANDANYEXPRESSORIMPLIED
WARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIEDWARRANTIESOF
MERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEAREDISCLAIMED.INNO
EVENTSHALLTHEAUTHORBELIABLEFORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,
EXEMPLARY,ORCONSEQUENTIALDAMAGES(INCLUDING,BUTNOTLIMITEDTO,
PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSSOFUSE,DATA,ORPROFITS;OR
BUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHER
INCONTRACT,STRICTLIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISING
INANYWAYOUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF
SUCHDAMAGE.
NVI DI A Corporat ion
CUBLAS Library PG-05326-032_V02
PortionsoftheSGEMM,DGEMMandZGEMMlibraryroutineswerewrittenbyDavideBarbieriandare
subjecttotheModifiedBerkeleySoftwareDistributionLicenseasfollows:
Copyright(c)20082009DavideBarbieri@UniversityofRomeTorVergata.
Allrightsreserved.
Redistributionanduseinsourceandbinaryforms,withorwithoutmodification,arepermittedprovided
thatthefollowingconditionsaremet:
Redistributionsofsourcecodemustretaintheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimer.
Redistributionsinbinaryformmustreproducetheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimerinthedocumentationand/orothermaterialsprovidedwiththedistribution.
Thenameoftheauthormaynotbeusedtoendorseorpromoteproductsderivedfromthissoftware
withoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHEAUTHORASISANDANYEXPRESSORIMPLIED
WARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIEDWARRANTIESOF
MERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEAREDISCLAIMED.INNO
EVENTSHALLTHEAUTHORBELIABLEFORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,
EXEMPLARY,ORCONSEQUENTIALDAMAGES(INCLUDING,BUTNOTLIMITEDTO,
PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSSOFUSE,DATA,ORPROFITS;OR
BUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHER
INCONTRACT,STRICTLIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISING
INANYWAYOUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF
SUCHDAMAGE.
NVI DI A Corporat ion
CUBLAS Library PG-05326-032_V02
PortionsoftheDGEMMandSGEMMlibraryroutinesoptimizedfortheFermiarchitecturewere
developedbytheUniversityofTennessee.Subsequently,severalotherroutinesthatareoptimizedforthe
FermiarchitecturehavebeenderivedfromtheseinitialDGEMMandSGEMMimplementations.Those
portionsofthesourcecodearethussubjecttotheModifiedBerkeleySoftwareDistributionLicenseas
follows:
Copyright(c)2010TheUniversityofTennessee.Allrightsreserved.
Redistributionanduseinsourceandbinaryforms,withorwithoutmodification,arepermittedprovided
thatthefollowingconditionsaremet:
Redistributionsofsourcecodemustretaintheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimer.
Redistributionsinbinaryformmustreproducetheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimerlistedinthislicenseinthedocumentationand/orothermaterialsprovidedwith
thedistribution.
Neitherthenameofthecopyrightholdersnorthenamesofitscontributorsmaybeusedtoendorseor
promoteproductsderivedfromthissoftwarewithoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHECOPYRIGHTHOLDERSANDCONTRIBUTORSASISAND
ANYEXPRESSORIMPLIEDWARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIED
WARRANTIESOFMERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEARE
DISCLAIMED.INNOEVENTSHALLTHECOPYRIGHTOWNERORCONTRIBUTORSBELIABLEFOR
ANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,EXEMPLARY,ORCONSEQUENTIALDAMAGES
(INCLUDING,BUTNOTLIMITEDTO,PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSS
OFUSE,DATA,ORPROFITS;ORBUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANY
THEORYOFLIABILITY,WHETHERINCONTRACT,STRICTLIABILITY,ORTORT(INCLUDING
NEGLIGENCEOROTHERWISE)ARISINGINANYWAYOUTOFTHEUSEOFTHISSOFTWARE,
EVENIFADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGE.
PG-05326-032_V02 v
NVI DI A
Tabl e of Cont ent s
1. The CUBLAS Li br ar y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
CUBLAS Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Type cublasSt at us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CUBLAS Helper Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Funct ion cublasI nit () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Funct ion cublasShut down() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasGet Error() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasAlloc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasFree() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Funct ion cublasSet Vect or() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Funct ion cublasGet Vect or() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Funct ion cublasSet Mat rix() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Funct ion cublasGet Mat rix() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Funct ion cublasSet KernelSt ream() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Funct ion cublasSet Vect orAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Funct ion cublasGet Vect orAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Funct ion cublasSet Mat rixAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Funct ion cublasGet Mat rixAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2. BLAS1 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Single-Precision BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Funct ion cublasI samax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Funct ion cublasI samin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Funct ion cublasSasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Funct ion cublasSaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Funct ion cublasScopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Funct ion cublasSdot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Funct ion cublasSnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Funct ion cublasSrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Funct ion cublasSrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Funct ion cublasSrot m() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Funct ion cublasSrot mg() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Funct ion cublasSscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Funct ion cublasSswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Single-Precision Complex BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
Funct ion cublasCcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
Funct ion cublasCdot c() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
Funct ion cublasCdot u() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
Funct ion cublasCrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
Funct ion cublasCrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Funct ion cublasCscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Funct ion cublasCsrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Funct ion cublasCsscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
Funct ion cublasCswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Funct ion cublasI camax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
Funct ion cublasI camin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Funct ion cublasScasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Funct ion cublasScnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53
Double-Precision BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
Funct ion cublasI damax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
Funct ion cublasI damin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
Funct ion cublasDasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
Funct ion cublasDaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
Funct ion cublasDcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
Funct ion cublasDdot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
Funct ion cublasDnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
Funct ion cublasDrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Funct ion cublasDrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63
Funct ion cublasDrot m() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
Funct ion cublasDrot mg() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
Funct ion cublasDscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66
Funct ion cublasDswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
Double-Precision Complex BLAS1 funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
Funct ion cublasDzasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
Funct ion cublasDznrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Funct ion cublasI zamax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Funct ion cublasI zamin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
Funct ion cublasZaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
Funct ion cublasZcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
Funct ion cublasZdot c() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
Funct ion cublasZdot u() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
Funct ion cublasZdrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
Funct ion cublasZdscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78
Funct ion cublasZrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
Funct ion cublasZrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Funct ion cublasZscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Funct ion cublasZswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
PG-05326-032_V02 vi i
NVI DI A
CUDA CUBLAS Library
3. Si ngl e-Pr eci si on BLAS2 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Single-Precision BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Funct ion cublasSgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Funct ion cublasSgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Funct ion cublasSger() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Funct ion cublasSsbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Funct ion cublasSspmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Funct ion cublasSspr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Funct ion cublasSspr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Funct ion cublasSsymv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Funct ion cublasSsyr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Funct ion cublasSsyr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Funct ion cublasSt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Funct ion cublasSt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Funct ion cublasSt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Funct ion cublasSt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Funct ion cublasSt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Funct ion cublasSt rsv(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Single-Precision Complex BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Funct ion cublasCgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Funct ion cublasCgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Funct ion cublasCgerc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Funct ion cublasCgeru() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Funct ion cublasChbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Funct ion cublasChemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Funct ion cublasCher() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Funct ion cublasCher2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Funct ion cublasChpmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Funct ion cublasChpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Funct ion cublasChpr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Funct ion cublasCt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Funct ion cublasCt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Funct ion cublasCt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Funct ion cublasCt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Funct ion cublasCt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Funct ion cublasCt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4. Doubl e-Pr eci si on BLAS2 Funct i ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Double-Precision BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Funct ion cublasDgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Funct ion cublasDgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Funct ion cublasDger() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Funct ion cublasDsbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Funct ion cublasDspmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
vi i i PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDspr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Funct ion cublasDspr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Funct ion cublasDsymv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Funct ion cublasDsyr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Funct ion cublasDsyr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Funct ion cublasDt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Funct ion cublasDt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Funct ion cublasDt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Funct ion cublasDt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Funct ion cublasDt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Funct ion cublasDt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Double-Precision Complex BLAS2 funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Funct ion cublasZgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Funct ion cublasZgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Funct ion cublasZgerc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Funct ion cublasZgeru() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Funct ion cublasZhbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Funct ion cublasZhemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Funct ion cublasZher() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Funct ion cublasZher2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Funct ion cublasZhpmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Funct ion cublasZhpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Funct ion cublasZhpr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Funct ion cublasZt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Funct ion cublasZt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Funct ion cublasZt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Funct ion cublasZt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Funct ion cublasZt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Funct ion cublasZt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5. BLAS3 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Single-Precision BLAS3 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Funct ion cublasSgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Funct ion cublasSsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Funct ion cublasSsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Funct ion cublasSsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Funct ion cublasSt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Funct ion cublasSt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Single-Precision Complex BLAS3 Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Funct ion cublasCgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Funct ion cublasChemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Funct ion cublasCherk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Funct ion cublasCher2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Funct ion cublasCsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
PG-05326-032_V02 i x
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Funct ion cublasCsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Funct ion cublasCt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Funct ion cublasCt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Double-Precision BLAS3 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Funct ion cublasDgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Funct ion cublasDsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Funct ion cublasDsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Funct ion cublasDsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Funct ion cublasDt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Funct ion cublasDt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Double-Precision Complex BLAS3 Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Funct ion cublasZgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Funct ion cublasZhemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Funct ion cublasZherk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Funct ion cublasZher2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Funct ion cublasZsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Funct ion cublasZsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Funct ion cublasZsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Funct ion cublasZt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Funct ion cublasZt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
A. CUBLAS For t r an Bi ndi ngs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
PG- 05326- 032_V02 11
NVI DI A
CH A P T E R
1
The CUBLAS Li br ar y
CUBLASisanimplementationofBLAS(BasicLinearAlgebra
Subprograms)ontopoftheNVIDIA

CUDA

runtime.Itallows
accesstothecomputationalresourcesofNVIDIAGPUs.Thelibraryis
selfcontainedattheAPIlevel,thatis,nodirectinteractionwiththe
CUDAdriverisnecessary.CUBLASattachestoasingleGPUanddoes
notautoparallelizeacrossmultipleGPUs.
ThebasicmodelbywhichapplicationsusetheCUBLASlibraryisto
creatematrixandvectorobjectsinGPUmemoryspace,fillthemwith
data,callasequenceofCUBLASfunctions,and,finally,uploadthe
resultsfromGPUmemoryspacebacktothehost.Toaccomplishthis,
CUBLASprovideshelperfunctionsforcreatinganddestroyingobjects
inGPUspace,andforwritingdatatoandretrievingdatafromthese
objects.
FormaximumcompatibilitywithexistingFortranenvironments,
CUBLASusescolumnmajorstorageand1basedindexing.SinceC
andC++userowmajorstorage,applicationscannotusethenative
arraysemanticsfortwodimensionalarrays.Instead,macrosorinline
functionsshouldbedefinedtoimplementmatricesontopofone
dimensionalarrays.ForFortrancodeportedtoCinmechanical
fashion,onemaychosetoretain1basedindexingtoavoidtheneedto
12 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
transformloops.Inthiscase,thearrayindexofamatrixelementin
rowiandcolumnjcanbecomputedviathefollowingmacro:
Here,ldreferstotheleadingdimensionofthematrixasallocated,
whichinthecaseofcolumnmajorstorageisthenumberofrows.For
nativelywrittenCandC++code,onewouldmostlikelychose0based
indexing,inwhichcasetheindexingmacrobecomes
Pleaserefertothecodeexamplesattheendofthissection,which
showatinyapplicationimplementedinFortranonthehost
(Example 1.Fortran77ApplicationExecutingontheHost)and
showversionsoftheapplicationwritteninCusingCUBLASforthe
indexingstylesdescribedabove(Example 2.ApplicationUsingCand
CUBLAS:1basedIndexingandExample 3.ApplicationUsingC
andCUBLAS:0basedIndexing).
BecausetheCUBLAScorefunctions(asopposedtothehelper
functions)donotreturnerrorstatusdirectly(forreasonsof
compatibilitywithexistingBLASlibraries),CUBLASprovidesa
separatefunctiontoaidindebuggingthatretrievesthelastrecorded
error.
TheinterfacetotheCUBLASlibraryistheheaderfilecublas.h.
ApplicationsusingCUBLASneedtolinkagainsttheDSOcublas.so
(Linux),theDLLcublas.dll(Windows),orthedynamiclibrary
cublas.dylib(MacOSX)whenbuildingforthedevice,andagainst
theDSOcublasemu.so(Linux),theDLLcublasemu.dll(Windows),
orthedynamiclibrarycublasemu.dylib(MacOSX)whenbuilding
fordeviceemulation.
Followingthesethreeexamples,theremainderofthischapter
discussesCUBLASTypesonpage 18andCUBLASHelper
Functionsonpage 19.
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
PG- 05326- 032_V02 13
NVI DI A
CHAPTER 1 The CUBLAS Library

Example 1. Fort ran 77 Applicat ion Execut ing on t he Host


subroutine modify (m, ldm, n, p, q, alpha, beta)
implicit none
integer ldm, n, p, q
real*4 m(ldm,*), alpha, beta
external sscal
call sscal (n-p+1, alpha, m(p,q), ldm)
call sscal (ldm-p+1, beta, m(p,q), 1)
return
end

program matrixmod
implicit none
integer M, N
parameter (M=6, N=5)
real*4 a(M,N)
integer i, j
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
enddo
enddo
call modify (a, M, N, 2, 3, 16.0, 12.0)
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
14 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

Example 2. Applicat ion Using C and CUBLAS: 1- based I ndexing


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
void modify (float *m, int ldm, int n, int p, int q, float alpha,
float beta)
{
cublasSscal (n-p+1, alpha, &m[IDX2F(p,q,ldm)], ldm);
cublasSscal (ldm-p+1, beta, &m[IDX2F(p,q,ldm)], 1);
}
#define M 6
#define N 5
int main (void)
{
int i, j;
cublasStatus stat;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
a[IDX2F(i,j,M)] = (i-1) * M + j;
}
}
cublasInit();
stat = cublasAlloc (M*N, sizeof(*a), (void**)&devPtrA);
PG- 05326- 032_V02 15
NVI DI A
CHAPTER 1 The CUBLAS Library
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("device memory allocation failed");
cublasShutdown();
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data download failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
modify (devPtrA, M, N, 2, 3, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data upload failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
cublasFree (devPtrA);
cublasShutdown();
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
printf ("%7.0f", a[IDX2F(i,j,M)]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
Example 2. Applicat ion Using C and CUBLAS: 1- based I ndexing ( cont inued)
16 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Example 3. Applicat ion Using C and CUBLAS: 0- based I ndexing
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
void modify (float *m, int ldm, int n, int p, int q, float alpha,
float beta)
{
cublasSscal (n-p, alpha, &m[IDX2C(p,q,ldm)], ldm);
cublasSscal (ldm-p, beta, &m[IDX2C(p,q,ldm)], 1);
}
#define M 6
#define N 5
int main (void)
{
int i, j;
cublasStatus stat;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
a[IDX2C(i,j,M)] = i * M + j + 1;
}
}
cublasInit();
stat = cublasAlloc (M*N, sizeof(*a), (void**)&devPtrA);
if (stat != CUBLAS_STATUS_SUCCESS) {
PG- 05326- 032_V02 17
NVI DI A
CHAPTER 1 The CUBLAS Library
printf ("device memory allocation failed");
cublasShutdown();
return EXIT_FAILURE;
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data download failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
modify (devPtrA, M, N, 1, 2, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("data upload failed");
cublasFree (devPtrA);
cublasShutdown();
return EXIT_FAILURE;
}
cublasFree (devPtrA);
cublasShutdown();
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
printf ("%7.0f", a[IDX2C(i,j,M)]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
Example 3. Applicat ion Using C and CUBLAS: 0- based I ndexing ( cont inued)
18 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
CUBLAS Types
TheonlyCUBLAStypeiscublasStatus.
Type cublasSt at us
ThetypecublasStatusisusedforfunctionstatusreturns.CUBLAS
helperfunctionsreturnstatusdirectly,whilethestatusofCUBLAS
corefunctionscanberetrievedviacublasGetError().Currently,the
followingvaluesaredefined:
cublasSt at us Values
CUBLAS_STATUS_SUCCESS
operation completed successfully
CUBLAS_STATUS_NOT_INITIALIZED
CUBLAS library not initialized
CUBLAS_STATUS_ALLOC_FAILED
resource allocation failed
CUBLAS_STATUS_INVALID_VALUE
unsupported numerical value was
passed to function
CUBLAS_STATUS_ARCH_MISMATCH
function requires an architectural
feature absent from the architecture of
the device
CUBLAS_STATUS_MAPPING_ERROR
access to GPU memory space failed
CUBLAS_STATUS_EXECUTION_FAILED
GPU program failed to execute
CUBLAS_STATUS_INTERNAL_ERROR
an internal CUBLAS operation failed
PG- 05326- 032_V02 19
NVI DI A
CHAPTER 1 The CUBLAS Library
CUBLAS Helper Funct ions
ThefollowingaretheCUBLAShelperfunctions:
FunctioncublasInit()onpage 19
FunctioncublasShutdown()onpage 20
FunctioncublasGetError()onpage 20
FunctioncublasAlloc()onpage 20
FunctioncublasFree()onpage 21
FunctioncublasSetVector()onpage 21
FunctioncublasGetVector()onpage 22
FunctioncublasSetMatrix()onpage 23
FunctioncublasGetMatrix()onpage 23
FunctioncublasSetKernelStream()onpage 24
FunctioncublasSetVectorAsync()onpage 24
FunctioncublasGetVectorAsync()onpage 25
FunctioncublasSetMatrixAsync()onpage 25
FunctioncublasGetMatrixAsync()onpage 26
Funct ion cublasI nit ( )
cublasStatus
cublasInit (void)
initializestheCUBLASlibraryandmustbecalledbeforeanyother
CUBLASAPIfunctionisinvoked.Itallocateshardwareresources
necessaryforaccessingtheGPU.ItattachesCUBLAStowhatever
GPUiscurrentlyboundtothehostthreadfromwhichitwasinvoked.
Ret urn Values
CUBLAS_STATUS_ALLOC_FAILED
if resources could not be allocated
CUBLAS_STATUS_SUCCESS
if CUBLAS library initialized successfully
20 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasShut down( )
cublasStatus
cublasShutdown (void)
releasesCPUsideresourcesusedbytheCUBLASlibrary.Therelease
ofGPUsideresourcesmaybedeferreduntiltheapplicationshuts
down.
Funct ion cublasGet Error( )
cublasStatus
cublasGetError (void)
returnsthelasterrorthatoccurredoninvocationofanyofthe
CUBLAScorefunctions.WhiletheCUBLAShelperfunctionsreturn
statusdirectly,theCUBLAScorefunctionsdonot,improving
compatibilitywiththoseexistingenvironmentsthatdonotexpect
BLASfunctionstoreturnstatus.Readingtheerrorstatusvia
cublasGetError()resetstheinternalerrorstateto
CUBLAS_STATUS_SUCCESS.
Funct ion cublasAlloc( )
cublasStatus
cublasAlloc (int n, int elemSize, void **devicePtr)
createsanobjectinGPUmemoryspacecapableofholdinganarrayof
nelements,whereeachelementrequireselemSizebytesofstorage.If
thefunctioncallissuccessful,apointertotheobjectinGPUmemory
spaceisplacedindevicePtr.Notethatthisisadevicepointerthat
cannotbedereferencedinhostcode.FunctioncublasAlloc()isa
wrapperaroundcudaMalloc().Devicepointersreturnedby
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_SUCCESS
CUBLAS library shut down successfully
PG- 05326- 032_V02 21
NVI DI A
CHAPTER 1 The CUBLAS Library
cublasAlloc()canthereforebepassedtoanyCUDAdevicekernels,
notjustCUBLASfunctions.
Funct ion cublasFree( )
cublasStatus
cublasFree (const void *devicePtr)
destroystheobjectinGPUmemoryspacereferencedbydevicePtr.
Funct ion cublasSet Vect or( )
cublasStatus
cublasSetVector (int n, int elemSize, const void *x,
int incx, void *y, int incy)
copiesnelementsfromavectorxinCPUmemoryspacetoavectory
inGPUmemoryspace.Elementsinbothvectorsareassumedtohavea
sizeofelemSizebytes.Storagespacingbetweenconsecutiveelements
isincxforthesourcevectorxandincyforthedestinationvectory.In
general,ypointstoanobject,orpartofanobject,allocatedvia
cublasAlloc().Columnmajorformatfortwodimensionalmatrices
isassumedthroughoutCUBLAS.Ifthevectorispartofamatrix,a
vectorincrementequalto1accessesa(partial)columnofthematrix.
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n <= 0 or elemSize <= 0
CUBLAS_STATUS_ALLOC_FAILED
if the object could not be allocated
due to lack of resources.
CUBLAS_STATUS_SUCCESS
if storage was successfully allocated
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INTERNAL_ERROR
if the object could not be deallocated
CUBLAS_STATUS_SUCCESS
if object was deallocated successfully
22 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Similarly,usinganincrementequaltotheleadingdimensionofthe
matrixaccessesa(partial)row.
Funct ion cublasGet Vect or( )
cublasStatus
cublasGetVector (int n, int elemSize, const void *x,
int incx, void *y, int incy)
copiesnelementsfromavectorxinGPUmemoryspacetoavectory
inCPUmemoryspace.Elementsinbothvectorsareassumedtohavea
sizeofelemSizebytes.Storagespacingbetweenconsecutiveelements
isincxforthesourcevectorxandincyforthedestinationvectory.In
general,xpointstoanobject,orpartofanobject,allocatedvia
cublasAlloc().Columnmajorformatfortwodimensionalmatrices
isassumedthroughoutCUBLAS.Ifthevectorispartofamatrix,a
vectorincrementequalto1accessesa(partial)columnofthematrix.
Similarly,usinganincrementequaltotheleadingdimensionofthe
matrixaccessesa(partial)row.
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx, incy, or elemSize <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx, incy, or elemSize <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
PG- 05326- 032_V02 23
NVI DI A
CHAPTER 1 The CUBLAS Library
Funct ion cublasSet Mat rix( )
cublasStatus
cublasSetMatrix (int rows, int cols, int elemSize,
const void *A, int lda, void *B,
int ldb)
copiesatileofrowscolselementsfromamatrixAinCPUmemory
spacetoamatrixBinGPUmemoryspace.Eachelementrequires
storageofelemSizebytes.Bothmatricesareassumedtobestoredin
columnmajorformat,withtheleadingdimension(thatis,thenumber
ofrows)ofsourcematrixAprovidedinlda,andtheleading
dimensionofdestinationmatrixBprovidedinldb.Bisadevice
pointerthatpointstoanobject,orpartofanobject,thatwasallocated
inGPUmemoryspaceviacublasAlloc().
Funct ion cublasGet Mat rix( )
cublasStatus
cublasGetMatrix (int rows, int cols, int elemSize,
const void *A, int lda, void *B,
int ldb)
copiesatileofrowscolselementsfromamatrixAinGPUmemory
spacetoamatrixBinCPUmemoryspace.Eachelementrequires
storageofelemSizebytes.Bothmatricesareassumedtobestoredin
columnmajorformat,withtheleadingdimension(thatis,thenumber
ofrows)ofsourcematrixAprovidedinlda,andtheleading
dimensionofdestinationmatrixBprovidedinldb.Aisadevice
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if rows or cols < 0; or elemSize,
lda, or ldb <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
24 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
pointerthatpointstoanobject,orpartofanobject,thatwasallocated
inGPUmemoryspaceviacublasAlloc().
Funct ion cublasSet KernelSt ream( )
cublasStatus
cublasSetKernelStream (cudaStream_t stream)
setstheCUBLASstreaminwhichallsubsequentCUBLASkernel
launcheswillrun.
Bydefault,iftheCUBLASstreamisnotset,allkernelsusetheNULL
stream.Thisroutinecanbeusedtochangethestreambetweenkernel
launchesandcanbeusedalsotosettheCUBLASstreambacktoNULL.
Funct ion cublasSet Vect orAsync( )
cublasStatus
cublasSetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy,
cudaStream_t stream);
hasthesamefunctionalityascublasSetVector(),butthetransferis
doneasynchronouslywithintheCUDAstreampassedinvia
parameter.
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if rows or cols < 0; or elemSize,
lda, or ldb <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_SUCCESS
if stream set successfully
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx, incy, or elemSize <= 0
PG- 05326- 032_V02 25
NVI DI A
CHAPTER 1 The CUBLAS Library
Funct ion cublasGet Vect orAsync( )
cublasStatus
cublasGetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy,
cudaStream_t stream)
hasthesamefunctionalityascublasGetVector(),butthetransferis
doneasynchronouslywithintheCUDAstreampassedinvia
parameter.
Funct ion cublasSet Mat rixAsync( )
cublasStatus
cublasSetMatrixAsync (int rows, int cols, int elemSize,
const void *A, int lda, void *B,
int ldb, cudaStream_t stream)
hasthesamefunctionalityascublasSetMatrix(),butthetransferis
doneasynchronouslywithintheCUDAstreampassedinvia
parameter.
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
Ret urn Values ( cont inued)
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx, incy, or elemSize <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if rows or cols < 0; or elemSize,
lda, or ldb <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
26 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasGet Mat rixAsync( )
cublasStatus
cublasGetMatrixAsync (int rows, int cols, int elemSize,
const void *A, int lda, void *B,
int ldb, cudaStream_t stream)
hasthesamefunctionalityascublasGetMatrix(),butthetransferis
doneasynchronouslywithintheCUDAstreampassedinvia
parameter.
Ret urn Values
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if rows, cols, elemSize, lda, or
ldb <= 0
CUBLAS_STATUS_MAPPING_ERROR
if error accessing GPU memory
CUBLAS_STATUS_SUCCESS
if operation completed successfully
PG- 05326- 032_V02 27
NVI DI A
CH A P T E R
2
BLAS1 Funct i ons
Level1BasicLinearAlgebraSubprograms(BLAS1)arefunctionsthat
performscalar,vector,andvectorvectoroperations.TheCUBLAS
BLAS1implementationisdescribedinthesesections:
SinglePrecisionBLAS1Functionsonpage 28
SinglePrecisionComplexBLAS1Functionsonpage 41
DoublePrecisionBLAS1Functionsonpage 55
DoublePrecisionComplexBLAS1functionsonpage 69
28 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Single- Precision BLAS1 Funct ions
ThesingleprecisionBLAS1functionsareasfollows:
FunctioncublasIsamax()onpage 29
FunctioncublasIsamin()onpage 29
FunctioncublasSasum()onpage 30
FunctioncublasSaxpy()onpage 31
FunctioncublasScopy()onpage 32
FunctioncublasSdot()onpage 33
FunctioncublasSnrm2()onpage 34
FunctioncublasSrot()onpage 34
FunctioncublasSrotg()onpage 35
FunctioncublasSrotm()onpage 36
FunctioncublasSrotmg()onpage 38
FunctioncublasSscal()onpage 39
FunctioncublasSswap()onpage 40
PG- 05326- 032_V02 29
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasI samax( )
int
cublasIsamax (int n, const float *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementofsingle
precisionvectorx;thatis,theresultisthefirsti,i=0ton-1,that
maximizes .Theresultreflects1basedindexing
forcompatibilitywithFortran.
Reference:http://www.netlib.org/blas/isamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI samin( )
int
cublasIsamin (int n, const float *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementofsingle
precisionvectorx;thatis,theresultisthefirsti,i=0ton-1,that
minimizes .Theresultreflects1basedindexing
forcompatibilitywithFortran.
I nput
n
number of elements in input vector
x
single-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
I nput
n
number of elements in input vector
x
single-precision vector with n elements
incx
storage spacing between elements of x
abs x 1 i incx * + | | ( )
30 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/scilib/blass.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSasum( )
float
cublasSasum (int n, const float *x, int incx)
computesthesumoftheabsolutevaluesoftheelementsofsingle
precisionvectorx;thatis,theresultisthesumfromi=0ton-1of
.
Reference:http://www.netlib.org/blas/sasum.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
single-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the single-precision sum of absolute values
(returns zero if n <= 0 or incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
PG- 05326- 032_V02 31
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasSaxpy( )
void
cublasSaxpy (int n, float alpha, const float *x,
int incx, float *y, int incy)
multipliessingleprecisionvectorxbysingleprecisionscalaralpha
andaddstheresulttosingleprecisionvectory;thatis,itoverwrites
singleprecisionywithsingleprecision .
Fori=0ton-1,itreplaces
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/saxpy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 0 ifincx>=0,else
;
I nput
n
number of elements in input vectors
alpha
single-precision scalar multiplier
x
single-precision vector with n elements
incx
storage spacing between elements of x
y
single-precision vector with n elements
incy
storage spacing between elements of y
Out put
y
single-precision result (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
alpha x y + *
y ly i incy * + | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
32 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasScopy( )
void
cublasScopy (int n, const float *x, int incx, float *y,
int incy)
copiesthesingleprecisionvectorxtothesingleprecisionvectory.For
i=0ton-1,itcopies
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/scopy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
to ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision vector with n elements
incx
storage spacing between elements of x
y
single-precision vector with n elements
incy
storage spacing between elements of y
Out put
y
contains single-precision vector x
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i incy * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 33
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasSdot ( )
float
cublasSdot (int n, const float *x, int incx,
const float *y, int incy)
computesthedotproductoftwosingleprecisionvectors.Itreturns
thedotproductofthesingleprecisionvectorsxandyifsuccessful,
and0.0fotherwise.Itcomputesthesumfori=0ton-1of
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/sdot.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision vector with n elements
incx
storage spacing between elements of x
y
single-precision vector with n elements
incy
storage spacing between elements of y
Out put
returns single-precision dot product (returns zero if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to execute on GPU
x lx i incx * + | | y ly i + incy * | | *
lx 1 1 n ( ) incx * + =
34 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSnrm2( )
float
cublasSnrm2 (int n, const float *x, int incx)
computestheEuclideannormofthesingleprecisionnvectorx(with
storageincrementincx).Thiscodeusesamultiphasemodelof
accumulationtoavoidintermediateunderflowandoverflow.
Reference:http://www.netlib.org/blas/snrm2.f
Reference:http://www.netlib.org/slatec/lin/snrm2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSrot ( )
void
cublasSrot (int n, float *x, int incx, float *y, int incy,
float sc, float ss)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
I nput
n
number of elements in input vector
x
single-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the Euclidian norm
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
lx = 1ifincx>=0,else
;
sc ss
ss sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 35
NVI DI A
CHAPTER 2 BLAS1 Funct ions
yistreatedsimilarlyusinglyandincy.
Reference:http://www.netlib.org/blas/srot.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSrot g( )
void
cublasSrotg (float *host_sa, float *host_sb,
float *host_sc, float *host_ss)
constructstheGivenstransformation
whichzerosthesecondentryofthe2vector .
I nput
n
number of elements in input vectors
x
single-precision vector with n elements
incx
storage spacing between elements of x
y
single-precision vector with n elements
incy
storage spacing between elements of y
sc
element of rotation matrix
ss
element of rotation matrix
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
G
sc ss
ss sc
, sc
2
ss
2
+ 1 = =
sa sb
T
36 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Thequantity overwritessainstorage.Thevalueof
sbisoverwrittenbyavaluezwhichallowsscandsstoberecovered
bythefollowingalgorithm:
ThefunctioncublasSrot(n,x,incx,y,incy,sc,ss)normallyis
callednexttoapplythetransformationtoa2nmatrix.Notethatthis
functionisprovidedforcompletenessandisrunexclusivelyonthe
host.
Reference:http://www.netlib.org/blas/srotg.f
Thisfunctiondoesnotsetanyerrorstatus.
Funct ion cublasSrot m( )
void cublasSrotm (int n, float *x, int incx, float *y,
int incy, const float *sparam)
appliesthemodifiedGivenstransformation,h,tothe2nmatrix
Theelementsofxarein ,i=0ton-1,where
if set and .
if
set and .
if
set and .
I nput
sa
single-precision scalar
sb
single-precision scalar
Out put
sa
single-precision r
sb
single-precision z
sc
single-precision result
ss
single-precision result
r sa
2
sb
2
+ =
z 1 = sc 0.0 = ss 1.0 =
abs z ( ) 1 <
sc 1 z
2
= ss z =
abs z ( ) 1 >
sc 1 z = ss 1 sc
2
=
lx = 1ifincx>=0,else
;
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 37
NVI DI A
CHAPTER 2 BLAS1 Funct ions
yistreatedsimilarlyusinglyandincy.
Withsparam[0]=sflag,hhasoneofthefollowingforms:
Reference:http://www.netlib.org/blas/srotm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
n
number of elements in input vectors.
x
single-precision vector with n elements.
incx
storage spacing between elements of x.
y
single-precision vector with n elements.
incy
storage spacing between elements of y.
sparam
5-element vector. sparam[0] is sflag described above. sparam[1]
through sparam[4] contain the 22 rotation matrix h: sparam[1]
contains sh00, sparam[2] contains sh10, sparam[3] contains
sh01, and sparam[4] contains sh11.
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
sflag -1.0f =
h
sh00 sh01
sh10 sh11
=
sflag 1.0f =
h
sh00 1.0f
-1.0f sh11
=
sflag 0.0f =
h
1.0f sh01
sh10 1.0f
=
sflag -2.0f =
h
1.0f 0.0f
0.0f 1.0f
=
38 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSrot mg( )
void
cublasSrotmg (float *host_sd1, float *host_sd2,
float *host_sx1, const float *host_sy1,
float *host_sparam)
constructsthemodifiedGivenstransformationmatrixhwhichzeros
thesecondcomponentofthe2vector .
Withsparam[0]=sflag,hhasoneofthefollowingforms:
sparam[1]throughsparam[4]containsh00,sh10,sh01,andsh11,
respectively.Valuesof1.0f,-1.0f,or0.0fimpliedbythevalueof
sflagarenotstoredinsparam.Notethatthisfunctionisprovidedfor
completenessandisrunexclusivelyonthehost.
I nput
sd1
single-precision scalar.
sd2
single-precision scalar.
sx1
single-precision scalar.
sy1
single-precision scalar.
Out put
sd1
changed to represent the effect of the transformation.
sd2
changed to represent the effect of the transformation.
sx1
changed to represent the effect of the transformation.
sparam
5-element vector. sparam[0] is sflag described above. sparam[1]
through sparam[4] contain the 22 rotation matrix h: sparam[1]
contains sh00, sparam[2] contains sh10, sparam[3] contains
sh01, and sparam[4] contains sh11.
sd1*sx1 sd2*sy1 , ( )
T
sflag -1.0f =
h
sh00 sh01
sh10 sh11
=
sflag 1.0f =
h
sh00 1.0f
-1.0f sh11
=
sflag 0.0f =
h
1.0f sh01
sh10 1.0f
=
sflag -2.0f =
h
1.0f 0.0f
0.0f 1.0f
=
PG- 05326- 032_V02 39
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Reference:http://www.netlib.org/blas/srotmg.f
Thisfunctiondoesnotsetanyerrorstatus.
Funct ion cublasSscal( )
void
cublasSscal (int n, float alpha, float *x, int incx)
replacessingleprecisionvectorxwithsingleprecisionalpha*x.For
i =0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/sscal.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 1ifincx>=0,else
.
I nput
n
number of elements in input vector
alpha
single-precision scalar multiplier
x
single-precision vector with n elements
incx
storage spacing between elements of x
Out put
x
single-precision result (unchanged if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
40 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSswap( )
void
cublasSswap (int n, float *x, int incx, float *y,
int incy)
interchangessingleprecisionvectorxwithsingleprecisionvectory.
Fori=0ton-1,itinterchanges
where
lyisdefinedinasimilarmannerusingincy.
Reference:http://www.netlib.org/blas/sswap.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision vector with n elements
incx
storage spacing between elements of x
y
single-precision vector with n elements
incy
storage spacing between elements of y
Out put
x
single-precision vector y (unchanged from input if n <= 0)
y
single-precision vector x (unchanged from input if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 41
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Single- Precision Complex BLAS1 Funct ions
ThesingleprecisioncomplexBLAS1functionsareasfollows:
FunctioncublasCaxpy()onpage 42
FunctioncublasCcopy()onpage 43
FunctioncublasCdotc()onpage 44
FunctioncublasCdotu()onpage 45
FunctioncublasCrot()onpage 46
FunctioncublasCrotg()onpage 47
FunctioncublasCscal()onpage 48
FunctioncublasCsrot()onpage 48
FunctioncublasCsscal()onpage 49
FunctioncublasCswap()onpage 50
FunctioncublasIcamax()onpage 51
FunctioncublasIcamin()onpage 52
FunctioncublasScasum()onpage 52
FunctioncublasScnrm2()onpage 53
42 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCaxpy( )
void
cublasCaxpy (int n, cuComplex alpha, const cuComplex *x,
int incx, cuComplex *y, int incy)
multipliessingleprecisioncomplexvectorxbysingleprecision
complexscalaralphaandaddstheresulttosingleprecisioncomplex
vectory;thatis,itoverwritessingleprecisioncomplexywithsingle
precisioncomplex .
Fori=0ton-1,itreplaces
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/caxpy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 0 ifincx>=0,else
;
I nput
n
number of elements in input vectors
alpha
single-precision complex scalar multiplier
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
y
single-precision complex result (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
alpha x y + *
y ly i + incy * | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
PG- 05326- 032_V02 43
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasCcopy( )
void
cublasCcopy (int n, const cuComplex *x, int incx,
cuComplex *y, int incy)
copiesthesingleprecisioncomplexvectorxtothesingleprecision
complexvectory.
Fori=0ton-1,itcopies
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/ccopy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
to ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
y
contains single-precision complex vector x
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
44 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCdot c( )
cuComplex
cublasCdotc (int n, const cuComplex *x, int incx,
const cuComplex *y, int incy)
computesthedotproductoftwosingleprecisioncomplexvectors,the
firstofwhichisconjugated.Itreturnsthedotproductofthecomplex
conjugateofsingleprecisioncomplexvectorxandthesingle
precisioncomplexvectoryifsuccessful,andcomplexzerootherwise.
Fori=0ton-1,itsumstheproducts
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/cdotc.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
* ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
returns single-precision complex dot product (zero if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 45
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasCdot u( )
cuComplex
cublasCdotu (int n, const cuComplex *x, int incx,
const cuComplex *y, int incy)
computesthedotproductoftwosingleprecisioncomplexvectors.It
returnsthedotproductofthesingleprecisioncomplexvectorsxandy
ifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsumsthe
products
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/cdotu.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
* ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
returns single-precision complex dot product (returns zero if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
46 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCrot ( )
void
cublasCrot (int n, cuComplex *x, int incx, cuComplex *y,
int incy, float sc, cuComplex cs)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Reference:http://netlib.org/lapack/explorehtml/crot.f.html
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
sc
single-precision cosine component of rotation matrix
cs
single-precision complex sine component of rotation matrix
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
sc cs
cs sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 47
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasCrot g( )
void
cublasCrotg (cuComplex *host_ca, cuComplex cb,
float *host_sc, float *host_cs)
constructsthecomplexGivenstransformation
whichzerosthesecondentryofthecomplex2vector .
Thequantity overwritescainstorage.Inthiscase,
,where
.
ThefunctioncublasCrot(n,x,incx,y,incy,sc,cs)normallyis
callednexttoapplythetransformationtoa2nmatrix.Notethatthis
functionisprovidedforcompletenessandisrunexclusivelyonthe
host.

Reference:http://www.netlib.org/blas/crotg.f
Thisfunctiondoesnotsetanyerrorstatus.
I nput
ca
single-precision complex scalar
cb
single-precision complex scalar
Out put
ca
single-precision complex
sc
single-precision cosine component of rotation matrix
cs
single-precision complex sine component of rotation matrix
G
sc cs
cs sc
, sc
-
sc cs
-
cs + 1 = =
ca cb
T
ca ca
-
ca, cb
ca, cb scale* ca scale
2
cb scale
2
+ =
scale ca cb + =
ca ca
-
ca, cb
48 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCscal( )
void
cublasCscal (int n, cuComplex alpha, cuComplex *x,
int incx)
replacessingleprecisioncomplexvectorxwithsingleprecision
complexalpha*x.
Fori = 0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/cscal.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCsrot ( )
void
cublasCsrot (int n, cuComplex *x, int incx, cuComplex *y,
int incy, float sc, float ss)
multipliesa22matrix withthe2nmatrix .
with ,
lx = 1ifincx>=0,else
.
I nput
n
number of elements in input vector
alpha
single-precision complex scalar multiplier
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
x
single-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i + incx * | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
sc ss
ss sc
x
T
y
T
PG- 05326- 032_V02 49
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Reference:http://www.netlib.org/blas/csrot.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCsscal( )
void
cublasCsscal (int n, float alpha, cuComplex *x, int incx)
replacessingleprecisioncomplexvectorxwithsingleprecision
complexalpha*x.Fori = 0ton-1,itreplaces
where
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
sc
single-precision cosine component of rotation matrix
ss
single-precision sine component of rotation matrix
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
with ,
lx = 1ifincx>=0,else
.
x lx i + incx * | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
50 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/csscal.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCswap( )
void
cublasCswap (int n, const cuComplex *x, int incx,
cuComplex *y, int incy)
interchangesthesingleprecisioncomplexvectorxwiththesingle
precisioncomplexvectory.Fori=0ton-1,itinterchanges
where
lyisdefinedinasimilarwayusingincy.
I nput
n
number of elements in input vector
alpha
single-precision scalar multiplier
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
x
single-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
with ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
y
single-precision complex vector with n elements
incy
storage spacing between elements of y
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 51
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Reference:http://www.netlib.org/blas/cswap.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI camax( )
int
cublasIcamax (int n, const cuComplex *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementofsingle
precisioncomplexvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatmaximizes .Theresultreflects1based
indexingforcompatibilitywithFortran.
Reference:http://www.netlib.org/blas/icamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
x
single-precision complex vector y (unchanged from input if n <= 0)
y
single-precision complex vector x (unchanged from input if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
52 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasI camin( )
int
cublasIcamin (int n, const cuComplex *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementofsingle
precisioncomplexvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatminimizes .Theresultreflects1based
indexingforcompatibilitywithFortran.
Reference:Analogoustohttp://www.netlib.org/blas/icamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasScasum( )
float
cublasScasum (int n, const cuDouble *x, int incx)
takesthesumoftheabsolutevaluesofacomplexvectorandreturnsa
singleprecisionresult.NotethatthisisnottheL1normofthevector.
Theresultisthesumfrom0ton-1of
where
I nput
n
number of elements in input vector
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
,
lx = 1ifincx<=0,else
.
abs real x lx i incx * + | | ( ) ( ) abs imag x lx i incx * + | | ( ) ( ) +
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 53
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Reference:http://www.netlib.org/blas/scasum.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasScnrm2( )
float
cublasScnrm2 (int n, const cuComplex *x, int incx)
computestheEuclideannormofsingleprecisioncomplexnvectorx.
Thisimplementationusessimplescalingtoavoidintermediate
underflowandoverflow.
Reference:http://www.netlib.org/blas/scnrm2.f
I nput
n
number of elements in input vector
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the single-precision sum of absolute values of real and imaginary parts
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
single-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the Euclidian norm
(returns zero if n <= 0, incx <= 0, or if an error occurred)
54 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
PG- 05326- 032_V02 55
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Double- Precision BLAS1 Funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
ThedoubleprecisionBLAS1functionsareasfollows:
FunctioncublasIdamax()onpage 56
FunctioncublasIdamin()onpage 56
FunctioncublasDasum()onpage 57
FunctioncublasDaxpy()onpage 58
FunctioncublasDcopy()onpage 59
FunctioncublasDdot()onpage 60
FunctioncublasDnrm2()onpage 61
FunctioncublasDrot()onpage 62
FunctioncublasDrotg()onpage 63
FunctioncublasDrotm()onpage 64
FunctioncublasDrotmg()onpage 65
FunctioncublasDscal()onpage 66
FunctioncublasDswap()onpage 67
56 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasI damax( )
int
cublasIdamax (int n, const double *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementof
doubleprecisionvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatmaximizes .Theresultreflects1based
indexingforcompatibilitywithFortran.
Reference:http://www.netlib.org/blas/idamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI damin( )
int
cublasIdamin (int n, const double *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementof
doubleprecisionvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatminimizes .Theresultreflects1based
indexingforcompatibilitywithFortran.
I nput
n
number of elements in input vector
x
double-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
abs x 1 i incx * + | | ( )
PG- 05326- 032_V02 57
NVI DI A
CHAPTER 2 BLAS1 Funct ions

Analogoustohttp://www.netlib.org/blas/idamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDasum( )
double
cublasDasum (int n, const double *x, int incx)
computesthesumoftheabsolutevaluesoftheelementsofdouble
precisionvectorx;thatis,theresultisthesumfromi=0ton-1of
.
Reference:http://www.netlib.org/blas/dasum.f
I nput
n
number of elements in input vector
x
double-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
double-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the double-precision sum of absolute values
(returns zero if n <= 0 or incx <= 0, or if an error occurred)
abs x 1 i incx * + | | ( )
58 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDaxpy( )
void
cublasDaxpy (int n, double alpha, const double *x,
int incx, double *y, int incy)
multipliesdoubleprecisionvectorxbydoubleprecisionscalaralpha
andaddstheresulttodoubleprecisionvectory;thatis,itoverwrites
doubleprecisionywithdoubleprecision .
Fori=0ton-1,itreplaces
where
lyisdefinedinasimilarwayusingincy.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
with ,
lx = 0 ifincx>=0,else
;
I nput
n
number of elements in input vectors
alpha
double-precision scalar multiplier
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
Out put
y
double-precision result (unchanged if n <= 0)
alpha x y + *
y ly i incy * + | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
PG- 05326- 032_V02 59
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Reference:http://www.netlib.org/blas/daxpy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDcopy( )
void
cublasDcopy (int n, const double *x, int incx, double *y,
int incy)
copiesthedoubleprecisionvectorxtothedoubleprecisionvectory.
Fori=0ton-1,itcopies
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/dcopy.f
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
to ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
Out put
y
contains double-precision vector x
x lx i incx * + | | y ly i incy * + | |
lx 1 1 n ( ) incx * + =
60 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDdot ( )
double
cublasDdot (int n, const double *x, int incx,
const double *y, int incy)
computesthedotproductoftwodoubleprecisionvectors.Itreturns
thedotproductofthedoubleprecisionvectorsxandyifsuccessful,
and0.0otherwise.Itcomputesthesumfori=0ton-1of
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/ddot.f
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
Out put
returns double-precision dot product (returns zero if n <= 0)
x lx i incx * + | | y ly i + incy * | | *
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 61
NVI DI A
CHAPTER 2 BLAS1 Funct ions
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDnrm2( )
double
cublasDnrm2 (int n, const double *x, int incx)
computestheEuclideannormofthedoubleprecisionnvectorx(with
storageincrementincx).Thiscodeusesamultiphasemodelof
accumulationtoavoidintermediateunderflowandoverflow.
Reference:http://www.netlib.org/blas/dnrm2.f
Reference:http://www.netlib.org/slatec/lin/dnrm2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
double-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the Euclidian norm
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
62 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDrot ( )
void
cublasDrot (int n, double *x, int incx, double *y,
int incy, double dc, double ds)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Reference:http://www.netlib.org/blas/drot.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
dc
element of rotation matrix
ds
element of rotation matrix
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
dc ds
ds dc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 63
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasDrot g( )
void
cublasDrotg (double *host_da, double *host_db,
double *host_dc, double *host_ds)
constructstheGivenstransformation
whichzerosthesecondentryofthe2vector .
Thequantity overwritesdainstorage.Thevalueof
dbisoverwrittenbyavaluezwhichallowsdcanddstoberecovered
bythefollowingalgorithm:
ThefunctioncublasDrot(n,x,incx,y,incy,dc,ds)normallyis
callednexttoapplythetransformationtoa2nmatrix.Notethatthis
functionisprovidedforcompletenessandisrunexclusivelyonthe
host.
Reference:http://www.netlib.org/blas/drotg.f
Thisfunctiondoesnotsetanyerrorstatus.
if set and .
if
set and .
if
set and .
I nput
da
double-precision scalar
db
double-precision scalar
Out put
da
double-precision r
db
double-precision z
dc
double-precision result
ds
double-precision result
G
dc ds
ds dc
, dc
2
ds
2
+ 1 = =
da db
T
r da
2
db
2
+ =
z 1 = dc 0.0 = ds 1.0 =
abs z ( ) 1 <
dc 1 z
2
= ds z =
abs z ( ) 1 >
dc 1 z = ds 1 dc
2
=
64 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDrot m( )
void
cublasDrotm (int n, double *x, int incx, double *y,
int incy, const double *dparam)
appliesthemodifiedGivenstransformation,h,tothe2nmatrix
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Withdparam[0]=dflag,hhasoneofthefollowingforms:
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors.
x
double-precision vector with n elements.
incx
storage spacing between elements of x.
y
double-precision vector with n elements.
incy
storage spacing between elements of y.
dparam
5-element vector. dparam[0] is dflag described above. dparam[1]
through dparam[4] contain the 22 rotation matrix h: dparam[1]
contains dh00, dparam[2] contains dh10, dparam[3] contains
dh01, and dparam[4] contains dh11.
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
dflag -1.0 =
h
dh00 dh01
dh10 dh11
=
dflag 1.0 =
h
dh00 1.0
-1.0 dh11
=
dflag 0.0 =
h
1.0 dh01
dh10 1.0
=
dflag -2.0 =
h
1.0 0.0
0.0 1.0
=
PG- 05326- 032_V02 65
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Reference:http://www.netlib.org/blas/drotm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDrot mg( )
void
cublasDrotmg (double *host_dd1, double *host_dd2,
double *host_dx1, const double *host_dy1,
double *host_dparam)
constructsthemodifiedGivenstransformationmatrixhwhichzeros
thesecondcomponentofthe2vector .
Withdparam[0]=dflag,hhasoneofthefollowingforms:
dparam[1]throughdparam[4]containdh00,dh10,dh01,anddh11,
respectively.Valuesof1.0,-1.0,or0.0impliedbythevalueofdflag
arenotstoredindparam.Notethatthisfunctionisprovidedfor
completenessandisrunexclusivelyonthehost.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
dd1
double-precision scalar
dd2
double-precision scalar
dx1
double-precision scalar
dy1
double-precision scalar
dd1*dx1 dd2*dy1 , ( )
T
dflag -1.0 =
h
dh00 dh01
dh10 dh11
=
dflag 1.0 =
h
dh00 1.0
-1.0 dh11
=
dflag 0.0 =
h
1.0 dh01
dh10 1.0
=
dflag -2.0 =
h
1.0 0.0
0.0 1.0
=
66 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/drotmg.f
Thisfunctiondoesnotsetanyerrorstatus.
Funct ion cublasDscal( )
void
cublasDscal (int n, double alpha, double *x, int incx)
replacesdoubleprecisionvectorxwithdoubleprecisionalpha*x.
Fori =0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/dscal.f
Out put
dd1
changed to represent the effect of the transformation
dd2
changed to represent the effect of the transformation
dx1
changed to represent the effect of the transformation
dparam
5-element vector. dparam[0] is dflag described above. dparam[1]
through dparam[4] contain the 22 rotation matrix h: dparam[1]
contains dh00, dparam[2] contains dh10, dparam[3] contains
dh01, and dparam[4] contains dh11.
with ,
lx = 1ifincx>=0,else
.
I nput
n
number of elements in input vector
alpha
double-precision scalar multiplier
x
double-precision vector with n elements
incx
storage spacing between elements of x
Out put
x
double-precision result (unchanged if n <= 0 or incx <= 0)
x lx i incx * + | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 67
NVI DI A
CHAPTER 2 BLAS1 Funct ions
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDswap( )
void
cublasDswap (int n, double *x, int incx, double *y,
int incy)
interchangesdoubleprecisionvectorxwithdoubleprecisionvectory.
Fori=0ton-1,itinterchanges
where
lyisdefinedinasimilarmannerusingincy.
Reference:http://www.netlib.org/blas/dswap.f
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
with ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
Out put
x
double-precision vector y (unchanged from input if n <= 0)
y
double-precision vector x (unchanged from input if n <= 0)
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
68 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
PG- 05326- 032_V02 69
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Double- Precision Complex BLAS1 funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
ThedoubleprecisioncomplexBLAS1functionsarelistedbelow:
FunctioncublasDzasum()onpage 70
FunctioncublasDznrm2()onpage 71
FunctioncublasIzamax()onpage 71
FunctioncublasIzamin()onpage 72
FunctioncublasZaxpy()onpage 73
FunctioncublasZcopy()onpage 74
FunctioncublasZdotc()onpage 75
FunctioncublasZdotu()onpage 76
FunctioncublasZdrot()onpage 77
FunctioncublasZdscal()onpage 78
FunctioncublasZrot()onpage 79
FunctioncublasZrotg()onpage 80
FunctioncublasZscal()onpage 80
FunctioncublasZswap()onpage 81
70 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDzasum( )
double
cublasDzasum (int n, const cuDoubleComplex *x, int incx)
takesthesumoftheabsolutevaluesofacomplexvectorandreturnsa
doubleprecisionresult.NotethatthisisnottheL1normofthevector.
Theresultisthesumfrom0ton-1of
where
Referencehttp://www.netlib.org/blas/dzasum.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
,
lx = 1ifincx<=0,else
.
I nput
n
number of elements in input vector
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the double-precision sum of absolute values of real and imaginary parts
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs real x lx i incx * + | | ( ) ( ) abs imag x lx i incx * + | | ( ) ( ) +
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 71
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasDznrm2( )
double
cublasDznrm2 (int n, const cuDoubleComplex *x, int incx)
computestheEuclideannormofdoubleprecisioncomplexnvectorx.
Thisimplementationusessimplescalingtoavoidintermediate
underflowandoverflow.
Reference:http://www.netlib.org/blas/dznrm2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI zamax( )
int
cublasIzamax (int n, const cuDoubleComplex *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementof
doubleprecisioncomplexvectorx;thatis,theresultisthefirsti,i=0
ton-1,thatmaximizes
.
I nput
n
number of elements in input vector
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the Euclidian norm
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
abs real x 1 i incx * + | | ( ) ( ) abs imag x 1 i incx * + | | ( ) ( ) +
72 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/izamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI zamin( )
int
cublasIzamin (int n, const cuDoubleComplex *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementof
doubleprecisioncomplexvectorx;thatis,theresultisthefirsti,i=0
ton-1,thatminimizes
.
Reference:analogoustoFunctioncublasIzamax()onpage 71.
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput
n
number of elements in input vector
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
abs real x 1 i incx * + | | ( ) ( ) abs imag x 1 i incx * + | | ( ) ( ) +
PG- 05326- 032_V02 73
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasZaxpy( )
void
cublasZaxpy (int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
multipliesdoubleprecisioncomplexvectorxbydoubleprecision
complexscalaralphaandaddstheresulttodoubleprecisioncomplex
vectory;thatis,itoverwritesdoubleprecisioncomplexywithdouble
precisioncomplex .
Fori=0ton-1,itreplaces
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/zaxpy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 0 ifincx>=0,else
;
I nput
n
number of elements in input vectors
alpha
double-precision complex scalar multiplier
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
y
double-precision complex result (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
alpha x y + *
y ly i incy * + | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
74 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZcopy( )
void
cublasZcopy (int n, const cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
copiesthedoubleprecisioncomplexvectorxtothedoubleprecision
complexvectory.Fori=0ton-1,itcopies
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/zcopy.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
to ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
y
contains double-precision complex vector x
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
x lx i incx * + | | y ly i incy * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 75
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasZdot c( )
cuDoubleComplex
cublasZdotc (int n, const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy)
computesthedotproductoftwodoubleprecisioncomplexvectors.It
returnsthedotproductofthedoubleprecisioncomplexvectorsxand
yifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsums
theproducts
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/zdotc.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
* ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
returns double-precision complex dot product (zero if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
76 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZdot u( )
cuDoubleComplex
cublasZdotu (int n, const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy)
computesthedotproductoftwodoubleprecisioncomplexvectors.It
returnsthedotproductofthedoubleprecisioncomplexvectorsxand
yifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsums
theproducts
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/zdotu.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
* ,
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
returns double-precision complex dot product (returns zero if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ALLOC_FAILED
if function could not allocate
reduction buffer
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 77
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasZdrot ( )
void
cublasZdrot (int n, cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy,
double c, double s)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Reference:http://www.netlib.org/blas/zdrot.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
c
double-precision cosine component of rotation matrix
s
double-precision sine component of rotation matrix
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
c s
s c
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
78 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZdscal( )
void
cublasZdscal (int n, double alpha, cuDoubleComplex *x,
int incx)
replacesdoubleprecisioncomplexvectorxwithdoubleprecision
complexalpha*x.
Fori = 0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/zdscal.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
with ,
lx = 1ifincx>=0,else
.
I nput
n
number of elements in input vector
alpha
double-precision scalar multiplier
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
x
double-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x lx i + incx * | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 79
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasZrot ( )
void
cublasZrot (int n, cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy, double sc,
cuDoubleComplex cs)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
yistreatedsimilarlyusinglyandincy.
Reference:http://netlib.org/lapack/explorehtml/zrot.f.html
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
lx = 1ifincx>=0,else
;
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
sc
double-precision cosine component of rotation matrix
cs
double-precision complex sine component of rotation matrix
Out put
x
rotated double-precision complex vector x (unchanged if n <= 0)
y
rotated double-precision complex vector y (unchanged if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
sc cs
cs sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
80 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZrot g( )
void
cublasZrotg (cuDoubleComplex *host_ca,
cuDoubleComplex *host_cb, double *host_sc,
double *host_cs)
constructsthecomplexGivenstransformation
whichzerosthesecondentryofthecomplex2vector .
Thequantity overwritescainstorage.Thefunction
cublasCrot(n,x,incx,y,incy,sc,cs)normallyiscallednextto
applythetransformationtoa2nmatrix.Notethatthisfunctionis
providedforcompletenessandisrunexclusivelyonthehost.
Reference:http://www.netlib.org/blas/zrotg.f
Thisfunctiondoesnotsetanyerrorstatus.
Funct ion cublasZscal( )
void
cublasZscal (int n, cuDoubleComplex alpha,
cuDoubleComplex *x, int incx)
replacesdoubleprecisioncomplexvectorxwithdoubleprecision
complexalpha*x.
I nput
ca
double-precision complex scalar
cb
double-precision complex scalar
Out put
ca
double-precision complex
sc
double-precision cosine component of rotation matrix
cs
double-precision complex sine component of rotation matrix
G
sc cs
cs sc
, sc
2
cs
2
+ 1 = =
ca cb
T
ca ca
-
ca, cb
ca ca
-
ca, cb
PG- 05326- 032_V02 81
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Fori = 0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/zscal.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZswap( )
void
cublasZswap (int n, cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
interchangesdoubleprecisioncomplexvectorxwithdoubleprecision
complexvectory.Fori=0ton-1,itinterchanges
where
with ,
lx = 1ifincx>=0,else
.
I nput
n
number of elements in input vector
alpha
double-precision complex scalar multiplier
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
Out put
x
double-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
x lx i + incx * | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
with ,
lx = 1ifincx>=0,else
;
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
82 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
lyisdefinedinasimilarmannerusingincy.
Reference:http://www.netlib.org/blas/zswap.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
n
number of elements in input vectors
x
double-precision complex vector with n elements
incx
storage spacing between elements of x
y
double-precision complex vector with n elements
incy
storage spacing between elements of y
Out put
x
double-precision complex vector y (unchanged from input if n <= 0)
y
double-precision complex vector x (unchanged from input if n <= 0)
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
PG- 05326- 032_V02 83
NVI DI A
CH A P T E R
3
Si ngl e- Pr eci si on BLAS2
Funct i ons
TheLevel2BasicLinearAlgebraSubprograms(BLAS2)arefunctions
thatperformmatrixvectoroperations.TheCUBLASimplementations
ofsingleprecisionBLAS2functionsaredescribedinthesesections:
SinglePrecisionBLAS2Functionsonpage 84
SinglePrecisionComplexBLAS2Functionsonpage 107
84 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Single- Precision BLAS2 Funct ions
ThesingleprecisionBLAS2functionsareasfollows:
FunctioncublasSgbmv()onpage 85
FunctioncublasSgemv()onpage 86
FunctioncublasSger()onpage 87
FunctioncublasSsbmv()onpage 88
FunctioncublasSspmv()onpage 90
FunctioncublasSspr()onpage 91
FunctioncublasSspr2()onpage 92
FunctioncublasSsymv()onpage 94
FunctioncublasSsyr()onpage 95
FunctioncublasSsyr2()onpage 96
FunctioncublasStbmv()onpage 98
FunctioncublasStbsv()onpage 99
FunctioncublasStpmv()onpage 101
FunctioncublasStpsv()onpage 102
FunctioncublasStrmv()onpage 104
FunctioncublasStrsv()onpage 105
PG- 05326- 032_V02 85
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasSgbmv( )
void
cublasSgbmv (char trans, int m, int n, int kl, int ku,
float alpha, const float *A, int lda,
const float *x, int incx, float beta,
float *y, int incy);
performsoneofthematrixvectoroperations
alphaandbetaaresingleprecisionscalars,andxandyaresingle
precisionvectors.Aisanmnbandmatrixconsistingofsingle
precisionelementswithklsubdiagonalsandkusuperdiagonals.
,
where or ,
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
m
the number of rows of matrix A; m must be at least zero.
n
the number of columns of matrix A; n must be at least zero.
kl
the number of subdiagonals of matrix A; kl must be at least zero.
ku
the number of superdiagonals of matrix A; ku must be at least zero.
alpha
single-precision scalar multiplier applied to op(A).
A
single-precision array of dimensions (lda, n). The leading
part of array A must contain the band matrix A,
supplied column by column, with the leading diagonal of the matrix in
row ku+1 of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row ku+2, and so
on. Elements in the array A that do not correspond to elements in the
band matrix (such as the top left kuku triangle) are not referenced.
lda
leading dimension of A; lda must be at least .
x
single-precision array of length at least
when trans == 'N' or 'n', and at least
otherwise.
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y. If beta is zero, y
is not read.
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
kl ku 1 + + ( ) n
kl ku 1 + +
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
86 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sgbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSgemv( )
void
cublasSgemv (char trans, int m, int n, float alpha,
const float *A, int lda, const float *x,
int incx, float beta, float *y, int incy)
performsoneofthematrixvectoroperations
alphaandbetaaresingleprecisionscalars,andxandyaresingle
precisionvectors.Aisanmnmatrixconsistingofsingleprecision
elements.MatrixAisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarrayinwhichAisstored.
y
single-precision array of length at least
when trans == 'N' or 'n' and at least
otherwise. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, kl < 0, ku < 0,
incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
,
where or ,
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 87
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/sgemv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSger( )
void
cublasSger (int m, int n, float alpha, const float *x,
int incx, const float *y, int incy, float *A,
int lda)
performsthesymmetricrank1operation
alpha
single-precision scalar multiplier applied to op(A).
A
single-precision array of dimensions (lda, n) if trans == 'N' or
'n', of dimensions (lda, m) otherwise; lda must be at least
max(1, m) if trans == 'N' or 'n' and at least max(1, n) otherwise.
lda
leading dimension of two-dimensional array used to store matrix A.
x
single-precision array of length at least if
trans == 'N' or 'n', else at least .
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y. If beta is zero, y
is not read.
y
single-precision array of length at least if
trans == 'N' or 'n', else at least .
incy
the storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
, A = alpha * x * y
T
+ A
88 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisionscalar,xisanmelementsingle
precisionvector,yisannelementsingleprecisionvector,andAisan
mnmatrixconsistingofsingleprecisionelements.MatrixAisstored
incolumnmajorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayusedtostoreA.
Reference:http://www.netlib.org/blas/sger.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsbmv( )
void
cublasSsbmv (char uplo, int n, int k, float alpha,
const float *A, int lda, const float *x,
int incx, float beta, float *y, int incy)
performsthematrixvectoroperation
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
y
single-precision array of length at least .
incy
the storage spacing between elements of y; incy must not be zero.
A
single-precision array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
, y alpha A x beta y * + * * =
PG- 05326- 032_V02 89
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
wherealphaandbetaaresingleprecisionscalars,andxandyare
nelementsingleprecisionvectors.Aisannnsymmetricbandmatrix
consistingofsingleprecisionelements,withksuperdiagonalsandthe
samenumberofsubdiagonals.
I nput
uplo
specifies whether the upper or lower triangular part of the symmetric
band matrix A is being supplied. If uplo == 'U' or 'u', the upper
triangular part is being supplied. If uplo == 'L' or 'l', the lower
triangular part is being supplied.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
k
specifies the number of superdiagonals of matrix A. Since the matrix is
symmetric, this is also the number of subdiagonals; k must be at least
zero.
alpha
single-precision scalar multiplier applied to A * x.
A
single-precision array of dimensions (lda, n). When uplo == 'U' or
'u', the leading (k+1)n part of array A must contain the upper
triangular band of the symmetric matrix, supplied column by column,
with the leading diagonal of the matrix in row k+1 of the array, the
first superdiagonal starting at position 2 in row k, and so on. The top
left kk triangle of the array A is not referenced. When uplo == 'L'
or 'l', the leading (k+1)n part of the array A must contain the
lower triangular band part of the symmetric matrix, supplied column
by column, with the leading diagonal of the matrix in row 1 of the
array, the first subdiagonal starting at position 1 in row 2, and so on.
The bottom right kk triangle of the array A is not referenced.
lda
leading dimension of A; lda must be at least k+1.
x
single-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y. If beta is zero, y
is not read.
y
single-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
90 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSspmv( )
void
cublasSspmv (char uplo, int n, float alpha,
const float *AP, const float *x, int incx,
float beta, float *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaresingleprecisionscalars,andxandyare
nelementsingleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofsingleprecisionelementsandissuppliedinpackedform.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if k < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to A * x.
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x
single-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
y alpha A x beta y * + * * =
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 91
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/sspmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSspr( )
void
cublasSspr (char uplo, int n, float alpha,
const float *x, int incx, float *AP)
performsthesymmetricrank1operation
wherealphaisasingleprecisionscalar,andxisannelementsingle
precisionvector.Aisasymmetricnnmatrixthatconsistsofsingle
precisionelementsandissuppliedinpackedform.
beta
single-precision scalar multiplier applied to vector y. If beta is zero, y
is not read.
y
single-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
A alpha x x
T
A + * * =
x x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
92 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sspr.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSspr2( )
void
cublasSspr2 (char uplo, int n, float alpha,
const float *x, int incx, const float *y,
int incy, float *AP)
performsthesymmetricrank2operation
wherealphaisasingleprecisionscalar,andxandyarenelement
singleprecisionvectors.Aisasymmetricnnmatrixthatconsistsof
singleprecisionelementsandissuppliedinpackedform.
incx
storage spacing between elements of x; incx must not be zero.
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
T
A + * * =
, A alpha x y
T
alpha y x
T
A + * * + * * =
PG- 05326- 032_V02 93
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions

Reference:http://www.netlib.org/blas/sspr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
y
single-precision array of length at least .
incy
storage spacing between elements of y; incy must not be zero.
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x y
T
alpha y x
T
* * + *
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
T
alpha y x
T
A + * * + * * =
94 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSsymv( )
void
cublasSsymv (char uplo, int n, float alpha,
const float *A, int lda, const float *x,
int incx, float beta, float *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaresingleprecisionscalars,andxandyare
nelementsingleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofsingleprecisionelementsandisstoredineitherupperor
lowerstoragemode.
,
I nput
uplo
specifies whether the upper or lower triangular part of the array A is
referenced. If uplo == 'U' or 'u', the symmetric matrix A is stored in
upper storage mode; that is, only the upper triangular part of A is
referenced while the lower triangular part of A is inferred. If uplo ==
'L' or 'l', the symmetric matrix A is stored in lower storage mode;
that is, only the lower triangular part of A is referenced while the upper
triangular part of A is inferred.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to A * x.
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
the leading nn upper triangular part of the array A must contain the
upper triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced. If uplo == 'L' or 'l', the
leading nn lower triangular part of the array A must contain the lower
triangular part of the symmetric matrix, and the strictly upper
triangular part of A is not referenced.
lda
leading dimension of A; lda must be at least max(1, n).
x
single-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y.
y alpha A x beta y * + * * =
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 95
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ssymv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsyr( )
void
cublasSsyr (char uplo, int n, float alpha,
const float *x, int incx, float *A, int lda)
performsthesymmetricrank1operation
wherealphaisasingleprecisionscalar,xisannelementsingle
precisionvector,andAisannnsymmetricmatrixconsistingofsingle
precisionelements.Aisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarraycontainingA.

y
single-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced. If uplo == 'L' or 'l', only the
lower triangular part of A is referenced.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
A alpha x x
T
A + * * =
x * x
T
1 n 1 ( ) abs incx ( ) * + ( )
96 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyr.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsyr2( )
void
cublasSsyr2 (char uplo, int n, float alpha,
const float *x, int incx, const float *y,
int incy, float *A, int lda)
performsthesymmetricrank2operation
wherealphaisasingleprecisionscalar,xandyarenelementsingle
precisionvectors,andAisannnsymmetricmatrixconsistingof
singleprecisionelements.
incx
the storage spacing between elements of x; incx must not be zero.
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
A contains the upper triangular part of the symmetric matrix, and the
strictly lower triangular part is not referenced. If uplo == 'L' or 'l',
A contains the lower triangular part of the symmetric matrix, and the
strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
A alpha x x
T
A + * * =
, A alpha x y
T
alpha y x
T
A + * * + * * =
PG- 05326- 032_V02 97
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ssyr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced and the lower triangular part of A is
inferred. If uplo == 'L' or 'l', only the lower triangular part of A is
referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
y
single-precision array of length at least .
incy
storage spacing between elements of y; incy must not be zero.
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
A contains the upper triangular part of the symmetric matrix, and the
strictly lower triangular part is not referenced. If uplo == 'L' or 'l',
A contains the lower triangular part of the symmetric matrix, and the
strictly upper triangular part is not referenced.
lda
leading dimension of A; lda must be at least max(1, n).
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x y
T
y x
T
* + *
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
alpha y x
T
A + * * + * * =
98 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSt bmv( )
void
cublasStbmv (char uplo, char trans, char diag, int n,
int k, const float *A, int lda, float *x,
int incx)
performsoneofthematrixvectoroperations
xisannelementsingleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularbandmatrixconsistingofsingle
precisionelements.
,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular. If diag == 'U' or
'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
k
specifies the number of superdiagonals or subdiagonals. If uplo ==
'U' or 'u', k specifies the number of superdiagonals. If uplo == 'L'
or 'l' k specifies the number of subdiagonals; k must at least be zero.
A
single-precision array of dimension (lda, n). If uplo == 'U' or 'u',
the leading (k+1)n part of the array A must contain the upper
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row k+1 of the array, the first superdiagonal
starting at position 2 in row k, and so on. The top left kk triangle of
the array A is not referenced. If uplo == 'L' or 'l', the leading
(k+1)n part of the array A must contain the lower triangular band
matrix, supplied column by column, with the leading diagonal of the
matrix in row 1 of the array, the first subdiagonal starting at position 1
in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 99
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/stbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt bsv( )
void
cublasStbsv (char uplo, char trans, char diag, int n,
int k, const float *A, int lda, float *X,
int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularbandmatrixwithk+1diagonals.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
lda
is the leading dimension of A; lda must be at least k+1.
x
single-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, k<0, or incx == 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where or ,
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
100 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

Reference:http://www.netlib.org/blas/stbsv.f
I nput
uplo
specifies whether the matrix is an upper or lower triangular band
matrix: If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
the number of rows and columns of matrix A; n must be at least zero.
k
specifies the number of superdiagonals or subdiagonals.
If uplo == 'U' or 'u', k specifies the number of superdiagonals. If
uplo == 'L' or 'l', k specifies the number of subdiagonals; k must
be at least zero.
A
single-precision array of dimension (lda, n). If uplo == 'U' or 'u',
the leading (k+1)n part of the array A must contain the upper
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row k+1 of the array, the first superdiagonal
starting at position 2 in row k, and so on. The top left kk triangle of
the array A is not referenced. If uplo == 'L' or 'l', the leading
(k+1)n part of the array A must contain the lower triangular band
matrix, supplied column by column, with the leading diagonal of the
matrix in row 1 of the array, the first sub-diagonal starting at position
1 in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x
single-precision array of length at least .
On entry, x contains the n-element right-hand side vector b. On exit, it
is overwritten with the solution vector x.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 101
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt pmv( )
void
cublasStpmv (char uplo, char trans, char diag, int n,
const float *AP, float *x, int incx)
performsoneofthematrixvectoroperations
xisannelementsingleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularmatrixconsistingofsingleprecision
elements.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n< 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', the array AP contains the upper triangular part of
the symmetric matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
102 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/stpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt psv( )
void
cublasStpsv (char uplo, char trans, char diag, int n,
const float *AP, float *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementsingleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrix.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
x
single-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where or ,
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 103
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions

Reference:http://www.netlib.org/blas/stpsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix is an upper or lower triangular matrix. If
uplo == 'U' or 'u', A is an upper triangular matrix. If uplo == 'L'
or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular matrix A,
packed sequentially, column by column; that is, if i <= j, A[i,j] is
stored in . If uplo == 'L' or 'l', array AP
contains the lower triangular matrix A, packed sequentially, column by
column; that is, if i >= j, A[i,j] is stored in
. When diag == 'U' or 'u', the
diagonal elements of A are not referenced and are assumed to be unity.
x
single-precision array of length at least .
On entry, x contains the n-element right-hand side vector b. On exit, it
is overwritten with the solution vector x.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
104 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSt rmv( )
void
cublasStrmv (char uplo, char trans, char diag, int n,
const float *A, int lda, float *x, int incx)
performsoneofthematrixvectoroperations
xisannelementsingleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularmatrixconsistingofsingleprecision
elements.

,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is an lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
the leading nn upper triangular part of the array A must contain the
upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading nn lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
either, but are assumed to be unity.
lda
leading dimension of A; lda must be at least max(1, n).
x
single-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
the storage spacing between elements of x; incx must not be zero.
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 105
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/strmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt rsv( )
void
cublasStrsv (char uplo, char trans, char diag, int n,
const float *A, int lda, float *x, int incx)
solvesasystemofequations
bandxarenelementsingleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrixconsistingofsingle
precisionelements.MatrixAisstoredincolumnmajorformat,and
ldaistheleadingdimensionofthetwodimensionalarray
containingA.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x op A ( ) x * =
,
where or ,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced. If uplo == 'L' or 'l', only
the lower triangular part of A may be referenced.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
106 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
A contains the upper triangular part of the symmetric matrix, and the
strictly lower triangular part is not referenced. If uplo == 'L' or 'l',
A contains the lower triangular part of the symmetric matrix, and the
strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
x
single-precision array of length at least .
On entry, x contains the n-element, right-hand-side vector b. On exit,
it is overwritten with the solution vector x.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 107
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Single- Precision Complex BLAS2 Funct ions
ThetwosingleprecisioncomplexBLAS2functionsareasfollows:
FunctioncublasCgbmv()onpage 108
FunctioncublasCgemv()onpage 109
FunctioncublasCgerc()onpage 111
FunctioncublasCgeru()onpage 112
FunctioncublasChbmv()onpage 113
FunctioncublasChemv()onpage 115
FunctioncublasCher()onpage 116
FunctioncublasCher2()onpage 117
FunctioncublasChpmv()onpage 119
FunctioncublasChpr()onpage 120
FunctioncublasChpr2()onpage 121
FunctioncublasCtbmv()onpage 123
FunctioncublasCtbsv()onpage 125
FunctioncublasCtpmv()onpage 126
FunctioncublasCtpsv()onpage 128
FunctioncublasCtrmv()onpage 129
FunctioncublasCtrsv()onpage 131
108 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCgbmv( )
void
cublasCgbmv (char trans, int m, int n, int kl, int ku,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *x, int incx,
cuComplex beta, cuComplex *y, int incy)
performsoneofthematrixvectoroperations
alphaandbetaaresingleprecisioncomplexscalars,andxandyare
singleprecisioncomplexvectors.Aisan mnbandmatrixconsisting
ofsingleprecisioncomplexelementswithklsubdiagonalsandku
superdiagonals.
,where
, ,or ;
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
kl
specifies the number of subdiagonals of matrix A; kl must be at least
zero.
ku
specifies the number of superdiagonals of matrix A; ku must be at
least zero.
alpha
single-precision complex scalar multiplier applied to op(A).
A
single-precision complex array of dimensions (lda, n). The leading
(kl+ku+1)n part of the array A must contain the band matrix A,
supplied column by column, with the leading diagonal of the matrix in
row (ku+1) of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row (ku+2), and
so on. Elements in the array A that do not correspond to elements in
the band matrix (such as the top left kuku triangle) are not
referenced.
lda
leading dimension A; lda must be at least (kl+ku+1).
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 109
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/cgbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCgemv( )
void
cublasCgemv (char trans, int m, int n,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *x, int incx,
cuComplex beta, cuComplex *y, int incy)
performsoneofthematrixvectoroperations
x
single-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incx
specifies the increment for the elements of x; incx must not be zero.
beta
single-precision complex scalar multiplier applied to vector y. If beta
is zero, y is not read.
y
single-precision complex array of length at least
if trans == 'N' or 'n', else at least
. If beta is zero, y is not read.
incy
on entry, incy specifies the increment for the elements of y; incy
must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
,
where , ,or ;
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
110 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
alphaandbetaaresingleprecisioncomplexscalars;andxandyare
singleprecisioncomplexvectors.Aisanmnmatrixconsistingof
singleprecisioncomplexelements.MatrixAisstoredincolumnmajor
format,andldaistheleadingdimensionofthetwodimensionalarray
inwhichAisstored.
Reference:http://www.netlib.org/blas/cgemv.f
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to op(A).
A
single-precision complex array of dimensions (lda, n) if trans ==
'N' or 'n', of dimensions (lda, m) otherwise; lda must be at least
max(1, m) if trans == 'N' or 'n' and at least max(1, n) otherwise.
lda
leading dimension of two-dimensional array used to store matrix A.
x
single-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
single-precision complex scalar multiplier applied to vector y. If beta
is zero, y is not read.
y
single-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incy
the storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
PG- 05326- 032_V02 111
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCgerc( )
void
cublasCgerc (int m, int n, cuComplex alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
performsthesymmetricrank1operation
wherealphaisasingleprecisioncomplexscalar,xisanmelement
singleprecisioncomplexvector,yisannelementsingleprecision
complexvector,andAisanmnmatrixconsistingofsingleprecision
complexelements.MatrixAisstoredincolumnmajorformat,andlda
istheleadingdimensionofthetwodimensionalarrayusedtostoreA.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to .
x
single-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
single-precision complex array of length at least
.
incy
the storage spacing between elements of y; incy must not be zero.
A
single-precision complex array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
A = alpha * x * y
H
+ A
x * y
H
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
112 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/cgerc.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCgeru( )
void
cublasCgeru (int m, int n, cuComplex alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
performsthesymmetricrank1operation
wherealphaisasingleprecisioncomplexscalar,xisanmelement
singleprecisioncomplexvector,yisannelementsingleprecision
complexvector,andAisanmnmatrixconsistingofsingleprecision
complexelements.MatrixAisstoredincolumnmajorformat,andlda
istheleadingdimensionofthetwodimensionalarrayusedtostoreA.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
A alpha x y
H
A + * * =
,
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to .
x
single-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
single-precision complex array of length at least
.
A = alpha * x * y
T
+ A
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 113
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/cgeru.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasChbmv( )
void
cublasChbmv (char uplo, int n, int k, cuComplex alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
cuComplex beta, cuComplex *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaresingleprecisioncomplexscalars,andxand
yarenelementsingleprecisioncomplexvectors.AisaHermitiannn
bandmatrixthatconsistsofsingleprecisioncomplexelements,withk
superdiagonalsandthesamenumberofsubdiagonals.
incy
the storage spacing between elements of y; incy must not be zero.
A
single-precision complex array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
A alpha x y
T
A + * * =
,
I nput
uplo
specifies whether the upper or lower triangular part of the Hermitian
band matrix A is being supplied. If uplo == 'U' or 'u', the upper
triangular part is being supplied. If uplo == 'L' or 'l', the lower
triangular part is being supplied.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
y alpha A x beta y * + * * =
114 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
k
specifies the number of superdiagonals of matrix A. Since the matrix is
Hermitian, this is also the number of subdiagonals; k must be at least
zero.
alpha
single-precision complex scalar multiplier applied to A * x.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading (k + 1)n part of array A must contain the
upper triangular band of the Hermitian matrix, supplied column by
column, with the leading diagonal of the matrix in row k + 1 of the
array, the first superdiagonal starting at position 2 in row k, and so on.
The top left kk triangle of array A is not referenced. When uplo ==
'L' or 'l', the leading (k + 1)n part of array A must contain the
lower triangular band part of the Hermitian matrix, supplied column
by column, with the leading diagonal of the matrix in row 1 of the
array, the first subdiagonal starting at position 1 in row 2, and so on.
The bottom right kk triangle of array A is not referenced. The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero.
lda
leading dimension of A; lda must be at least k + 1.
x
single-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision complex scalar multiplier applied to vector y.
y
single-precision complex array of length at least
. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if k < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
PG- 05326- 032_V02 115
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasChemv( )
void
cublasChemv (char uplo, int n, cuComplex alpha,
const cuComplex *A, int lda,
const cuComplex *x, int incx,
cuComplex beta, cuComplex *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaresingleprecisioncomplexscalars,andxand
yarenelementsingleprecisioncomplexvectors.AisaHermitiannn
matrixthatconsistsofsingleprecisioncomplexelementsandisstored
ineitherupperorlowerstoragemode.

,
I nput
uplo
specifies whether the upper or lower triangular part of the array A is
referenced. If uplo == 'U' or 'u', the Hermitian matrix A is stored in
upper storage mode; that is, only the upper triangular part of A is
referenced while the lower triangular part of A is inferred. If uplo ==
'L' or 'l', the Hermitian matrix A is stored in lower storage mode;
that is, only the lower triangular part of A is referenced while the upper
triangular part of A is inferred.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to A * x.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading nn upper triangular part of the array A must
contain the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced. If uplo == 'L' or
'l', the leading nn lower triangular part of the array A must contain
the lower triangular part of the Hermitian matrix, and the strictly
upper triangular part of A is not referenced. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero.
lda
leading dimension of A; lda must be at least max(1, n).
x
single-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
y alpha A x beta y * + * * =
1 n 1 ( ) abs incx ( ) * + ( )
116 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chemv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCher( )
void
cublasCher (char uplo, int n, float alpha,
const cuComplex *x, int incx, cuComplex *A,
int lda)
performstheHermitianrank1operation
wherealphaisasingleprecisionscalar,xisannelementsingle
precisioncomplexvector,andAisannnHermitianmatrixconsisting
ofsingleprecisioncomplexelements.Aisstoredincolumnmajor
format,andldaistheleadingdimensionofthetwodimensionalarray
containingA.
beta
single-precision complex scalar multiplier applied to vector y.
y
single-precision complex array of length at least
. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced. If uplo == 'L' or 'l', only the
lower triangular part of A is referenced.
n
the number of rows and columns of matrix A; n must be at least zero.
A alpha x x
H
A + * * =
PG- 05326- 032_V02 117
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/cher.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCher2( )
void
cublasCher2 (char uplo, int n, cuComplex alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
performstheHermitianrank2operation
alpha
single-precision scalar multiplier applied to
.
x
single-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the Hermitian
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the Hermitian
matrix, and the strictly upper triangular part is not referenced. The
imaginary parts of the diagonal elements need not be set, they are
assumed to be zero, and on exit they are set to zero.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
Out put
A
updated according to
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
A alpha x x
H
A + * * =
, A alpha x y
H
alpha y x
H
A + * * + * * =
118 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisioncomplexscalar,xandyaren
elementsingleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofsingleprecisioncomplexelements.
Reference:http://www.netlib.org/blas/cher2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to
and whose conjugate is applied to .
x
single-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
y
single-precision array of length at least .
incy
the storage spacing between elements of y; incy must not be zero.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the Hermitian
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the Hermitian
matrix, and the strictly upper triangular part is not referenced. The
imaginary parts of the diagonal elements need not be set, they are
assumed to be zero, and on exit they are set to zero.
lda
leading dimension of A; lda must be at least max(1, n).
Out put
A
updated according to
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
H
alpha y x
H
A + * * + * * =
PG- 05326- 032_V02 119
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasChpmv( )
void
cublasChpmv (char uplo, int n, cuComplex alpha,
const cuComplex *AP, const cuComplex *x,
int incx, cuComplex beta, cuComplex *y,
int incy)
performsthematrixvectoroperation
wherealphaandbetaaresingleprecisioncomplexscalars,andxand
yarenelementsingleprecisioncomplexvectors.AisaHermitiannn
matrixthatconsistsofsingleprecisioncomplexelementsandis
suppliedinpackedform.
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to A * x.
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero.
x
single-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y.
y
single-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
y alpha A x beta y * + * * =
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
120 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasChpr( )
void
cublasChpr (char uplo, int n, float alpha,
const cuComplex *x, int incx, cuComplex *AP)
performstheHermitianrank1operation
wherealphaisasingleprecisionscalar,xisannelementsingle
precisioncomplexvector,andAisannnHermitianmatrixconsisting
ofsingleprecisioncomplexelementsthatissuppliedinpackedform.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
y alpha A x beta y * + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision scalar multiplier applied to .
x
single-precision complex array of length at least
.
A alpha x x
H
A + * * =
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 121
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/chpr.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasChpr2( )
void
cublasChpr2 (char uplo, int n, cuComplex alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy, cuComplex *AP)
performstheHermitianrank2operation
wherealphaisasingleprecisioncomplexscalar,xandyaren
elementsingleprecisioncomplexvectors,andAisannnHermitian
incx
the storage spacing between elements of x; incx must not be zero.
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero, and on exit they are set to zero.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
H
A + * * =
, A alpha x y
H
alpha y x
H
A + * * + * * =
122 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
matrixconsistingofsingleprecisioncomplexelementsthatissupplied
inpackedform.
Reference:http://www.netlib.org/blas/chpr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to
and whose conjugate is applied to .
x
single-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
single-precision complex array of length at least
.
incy
the storage spacing between elements of y; incy must not be zero.
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero, and on exit they are set to zero.
Out put
A
updated according to
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
H
alpha y x
H
A + * * + * * =
PG- 05326- 032_V02 123
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasCt bmv( )
void
cublasCtbmv (char uplo, char trans, char diag, int n,
int k, const cuComplex *A, int lda,
cuComplex *x, int incx)
performsoneofthematrixvectoroperations
xisannelementsingleprecisioncomplexvector,andAisannn,unit
ornonunit,upperorlower,triangularbandmatrixconsistingof
singleprecisioncomplexelements.

,
where , ,or ;
I nput
uplo
specifies whether the matrix A is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular. If diag == 'U' or
'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
k
specifies the number of superdiagonals or subdiagonals. If uplo ==
'U' or 'u', k specifies the number of superdiagonals. If uplo == 'L'
or 'l' k specifies the number of subdiagonals; k must at least be zero.
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
124 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
single-precision complex array of dimension (lda, n). If uplo ==
'U' or 'u', the leading (k+1)n part of the array A must contain the
upper triangular band matrix, supplied column by column, with the
leading diagonal of the matrix in row k+1 of the array, the first
superdiagonal starting at position 2 in row k, and so on. The top left
kk triangle of the array A is not referenced. If uplo == 'L' or 'l',
the leading (k+1)n part of the array A must contain the lower
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row 1 of the array, the first subdiagonal
starting at position 1 in row 2, and so on. The bottom right kk
triangle of the array is not referenced.
lda
is the leading dimension of A; lda must be at least k+1.
x
single-precision complex array of length at least
.
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0, k<0, or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 125
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasCt bsv( )
void
cublasCtbsv (char uplo, char trans, char diag, int n,
int k, const cuComplex *A, int lda,
cuComplex *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularbandmatrixwithk+1diagonals.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
,
where , ,or ;
I nput
uplo
specifies whether the matrix is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
the number of rows and columns of matrix A; n must be at least zero.
k
specifies the number of superdiagonals or subdiagonals.
If uplo == 'U' or 'u', k specifies the number of superdiagonals. If
uplo == 'L' or 'l', k specifies the number of subdiagonals; k must
be at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
126 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctbsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCt pmv( )
void
cublasCtpmv (char uplo, char trans, char diag, int n,
const cuComplex *AP, cuComplex *x, int incx)
performsoneofthematrixvectoroperations
A
single-precision complex array of dimension (lda, n). If uplo ==
'U' or 'u', the leading (k+1)n part of the array A must contain the
upper triangular band matrix, supplied column by column, with the
leading diagonal of the matrix in row k+1 of the array, the first
superdiagonal starting at position 2 in row k, and so on. The top left
kk triangle of the array A is not referenced. If uplo == 'L' or 'l',
the leading (k+1)n part of the array A must contain the lower
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row 1 of the array, the first sub-diagonal
starting at position 1 in row 2, and so on. The bottom right kk
triangle of the array is not referenced.
x
single-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n< 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 127
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
xisannelementsingleprecisioncomplexvector,andAisannn,unit
ornonunit,upperorlower,triangularmatrixconsistingofsingle
precisioncomplexelements.
Reference:http://www.netlib.org/blas/ctpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', the array AP contains the upper triangular part
of the symmetric matrix A, packed sequentially, column by column;
that is, if i <= j, A[i,j] is stored in . If
uplo == 'L' or 'l', array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x
single-precision complex array of length at least
. On entry, x contains the source vector.
On exit, x is overwritten with the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
128 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCt psv( )
void
cublasCtpsv (char uplo, char trans, char diag, int n,
const cuComplex *AP, cuComplex *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementcomplexvectors,andAisannn,unitornon
unit,upperorlower,triangularmatrix.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
,
where , ,or ;
I nput
uplo
specifies whether the matrix is an upper or lower triangular matrix. If
uplo == 'U' or 'u', A is an upper triangular matrix. If uplo == 'L'
or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 129
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ctpsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCt rmv( )
void
cublasCtrmv (char uplo, char trans, char diag, int n,
const cuComplex *A, int lda, cuComplex *x,
int incx)
performsoneofthematrixvectoroperations
xisannelementsingleprecisioncomplexvector;andAisannn,unit
ornonunit,upperorlower,triangularmatrixconsistingofsingle
precisioncomplexelements.
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular matrix
A, packed sequentially, column by column; that is, if i <= j, A[i,j] is
stored in . If uplo == 'L' or 'l', array AP
contains the lower triangular matrix A, packed sequentially, column by
column; that is, if i >= j, A[i,j] is stored in
. When diag == 'U' or 'u', the
diagonal elements of A are not referenced and are assumed to be unity.
x
single-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
130 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

Reference:http://www.netlib.org/blas/ctrmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is an lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading nn upper triangular part of the array A must
contain the upper triangular matrix, and the strictly lower triangular
part of A is not referenced. If uplo == 'L' or 'l', the leading nn
lower triangular part of the array A must contain the lower triangular
matrix, and the strictly upper triangular part of A is not referenced.
When diag == 'U' or 'u', the diagonal elements of A are not
referenced either, but are assumed to be unity.
lda
leading dimension of A; lda must be at least max(1, n).
x
single-precision complex array of length at least
. On entry, x contains the source vector.
On exit, x is overwritten with the result vector.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 131
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasCt rsv( )
void
cublasCtrsv (char uplo, char trans, char diag, int n,
const cuComplex *A, int lda, cuComplex *x,
int incx)
solvesasystemofequations
bandxarenelementsingleprecisioncomplexvectors,andAisan
nn,unitornonunit,upperorlower,triangularmatrixconsistingof
singleprecisionelements.MatrixAisstoredincolumnmajorformat,
andldaistheleadingdimensionofthetwodimensionalarray
containingA.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
,
where , ,or ;
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced. If uplo == 'L' or 'l', only
the lower triangular part of A may be referenced.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
132 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctrsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the symmetric
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
x
single-precision complex array of length at least
. On entry, x contains the n-element, right-
hand-side vector b. On exit, it is overwritten with solution vector x.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 133
NVI DI A
CH A P T E R
4
Doubl e- Pr eci si on BLAS2
Funct i ons
TheLevel2BasicLinearAlgebraSubprograms(BLAS2)arefunctions
thatperformmatrixvectoroperations.TheCUBLASimplementations
ofdoubleprecisionBLAS2functionsaredescribedinthesesections:
DoublePrecisionBLAS2Functionsonpage 134
DoublePrecisionComplexBLAS2functionsonpage 158
134 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Double- Precision BLAS2 Funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
ThedoubleprecisionBLAS2functionsareasfollows:
FunctioncublasDgbmv()onpage 135
FunctioncublasDgemv()onpage 136
FunctioncublasDger()onpage 138
FunctioncublasDsbmv()onpage 139
FunctioncublasDspmv()onpage 141
FunctioncublasDspr()onpage 142
FunctioncublasDspr2()onpage 143
FunctioncublasDsymv()onpage 144
FunctioncublasDsyr()onpage 146
FunctioncublasDsyr2()onpage 147
FunctioncublasDtbmv()onpage 148
FunctioncublasDtbsv()onpage 150
FunctioncublasDtpmv()onpage 152
FunctioncublasDtpsv()onpage 153
FunctioncublasDtrmv()onpage 154
FunctioncublasDtrsv()onpage 156
PG- 05326- 032_V02 135
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasDgbmv( )
void
cublasDgbmv (char trans, int m, int n, int kl, int ku,
double alpha, const double *A, int lda,
const double *x, int incx, double beta,
double *y, int incy)
performsoneofthematrixvectoroperations
alphaandbetaaredoubleprecisionscalars,andxandyaredouble
precisionvectors.Aisan mnbandmatrixconsistingofdouble
precisionelementswithklsubdiagonalsandkusuperdiagonals.
,
where or ,
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
kl
specifies the number of subdiagonals of matrix A; kl must be at least
zero.
ku
specifies the number of superdiagonals of matrix A; ku must be at
least zero.
alpha
double-precision scalar multiplier applied to op(A).
A
double-precision array of dimensions (lda, n). The leading
(kl+ku+1)n part of the array A must contain the band matrix A,
supplied column by column, with the leading diagonal of the matrix in
row (ku+1) of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row (ku+2), and
so on. Elements in the array A that do not correspond to elements in
the band matrix (such as the top left kuku triangle) are not
referenced.
lda
leading dimension A; lda must be at least (kl+ku+1).
x
double-precision array of length at least if
trans == 'N' or 'n', else at least .
incx
specifies the increment for the elements of x; incx must not be zero.
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
136 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dgbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDgemv( )
void
cublasDgemv (char trans, int m, int n, double alpha,
const double *A, int lda, const double *x,
int incx, double beta, double *y, int incy)
performsoneofthematrixvectoroperations
alphaandbetaaredoubleprecisionscalars,andxandyaredouble
precisionvectors.Aisan mnmatrixconsistingofdoubleprecision
beta
double-precision scalar multiplier applied to vector y. If beta is zero,
y is not read.
y
double-precision array of length at least if
trans == 'N' or 'n', else at least . If
beta is zero, y is not read.
incy
on entry, incy specifies the increment for the elements of y; incy
must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
,
where or ,
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 137
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
elements.MatrixAisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarrayinwhichAisstored.
Reference:http://www.netlib.org/blas/dgemv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to op(A).
A
double-precision array of dimensions (lda, n) if trans == 'N' or
'n', of dimensions (lda, m) otherwise; lda must be at least
max(1, m) if trans == 'N' or 'n' and at least max(1, n) otherwise.
lda
leading dimension of two-dimensional array used to store matrix A.
x
double-precision array of length at least if
trans == 'N' or 'n', else at least .
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y. If beta is zero,
y is not read.
y
double-precision array of length at least if
trans == 'N' or 'n', else at least .
incy
the storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
138 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDger( )
void
cublasDger (int m, int n, double alpha, const double *x,
int incx, const double *y, int incy,
double *A, int lda)
performsthesymmetricrank1operation
wherealphaisadoubleprecisionscalar,xisanmelementdouble
precisionvector,yisannelementdoubleprecisionvector,andAisan
mnmatrixconsistingofdoubleprecisionelements.MatrixAisstored
incolumnmajorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayusedtostoreA.
Reference:http://www.netlib.org/blas/dger.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
,
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to .
x
double-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
y
double-precision array of length at least .
incy
the storage spacing between elements of y; incy must not be zero.
A
double-precision array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
A = alpha * x * y
T
+ A
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
PG- 05326- 032_V02 139
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasDsbmv( )
void
cublasDsbmv (char uplo, int n, int k, double alpha,
const double *A, int lda, const double *x,
int incx, double beta, double *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisionscalars,andxandyare
nelementdoubleprecisionvectors.Aisannnsymmetricbandmatrix
consistingofdoubleprecisionelements,withksuperdiagonalsand
thesamenumberofsubdiagonals.
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
,
I nput
uplo
specifies whether the upper or lower triangular part of the symmetric
band matrix A is being supplied. If uplo == 'U' or 'u', the upper
triangular part is being supplied. If uplo == 'L' or 'l', the lower
triangular part is being supplied.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
k
specifies the number of superdiagonals of matrix A. Since the matrix is
symmetric, this is also the number of subdiagonals; k must be at least
zero.
alpha
double-precision scalar multiplier applied to A * x.
y alpha A x beta y * + * * =
140 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
double-precision array of dimensions (lda, n). When uplo == 'U'
or 'u', the leading (k+1)n part of array A must contain the upper
triangular band of the symmetric matrix, supplied column by column,
with the leading diagonal of the matrix in row k+1 of the array, the
first superdiagonal starting at position 2 in row k, and so on. The top
left kk triangle of the array A is not referenced. When uplo == 'L'
or 'l', the leading (k+1)n part of the array A must contain the
lower triangular band part of the symmetric matrix, supplied column
by column, with the leading diagonal of the matrix in row 1 of the
array, the first subdiagonal starting at position 1 in row 2, and so on.
The bottom right kk triangle of the array A is not referenced.
lda
leading dimension of A; lda must be at least k+1.
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y.
y
double-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if k < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
PG- 05326- 032_V02 141
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasDspmv( )
void
cublasDspmv (char uplo, int n, double alpha,
const double *AP, const double *x, int incx,
double beta, double *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisionscalars,andxandyare
nelementdoubleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofdoubleprecisionelementsandissuppliedinpackedform.
Reference:http://www.netlib.org/blas/dspmv.f
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to A * x.
AP
double-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y.
y
double-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
y alpha A x beta y * + * * =
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
142 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDspr( )
void
cublasDspr (char uplo, int n, double alpha,
const double *x, int incx, double *AP)
performsthesymmetricrank1operation
wherealphaisadoubleprecisionscalar,andxisannelement
doubleprecisionvector.Aisasymmetricnnmatrixthatconsistsof
doubleprecisionelementsandissuppliedinpackedform.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to .
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
AP
double-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
A alpha x x
T
A + * * =
x x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
PG- 05326- 032_V02 143
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/dspr.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDspr2( )
void
cublasDspr2 (char uplo, int n, double alpha,
const double *x, int incx, const double *y,
int incy, double *AP)
performsthesymmetricrank2operation
wherealphaisadoubleprecisionscalar,andxandyarenelement
doubleprecisionvectors.Aisasymmetricnnmatrixthatconsistsof
doubleprecisionelementsandissuppliedinpackedform.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
A alpha x x
T
A + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to and .
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
y
double-precision array of length at least .
A alpha x y
T
alpha y x
T
A + * * + * * =
x y
T
* y x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
144 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dspr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDsymv( )
void
cublasDsymv (char uplo, int n, double alpha,
const double *A, int lda, const double *x,
int incx, double beta, double *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisionscalars,andxandyare
nelementdoubleprecisionvectors.Aisasymmetricnnmatrixthat
incy
storage spacing between elements of y; incy must not be zero.
AP
double-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
T
alpha y x
T
A + * * + * * =
, y alpha A x beta y * + * * =
PG- 05326- 032_V02 145
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
consistsofdoubleprecisionelementsandisstoredineitherupperor
lowerstoragemode.
Reference:http://www.netlib.org/blas/dsymv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the upper or lower triangular part of the array A is
referenced. If uplo == 'U' or 'u', the symmetric matrix A is stored in
upper storage mode; that is, only the upper triangular part of A is
referenced while the lower triangular part of A is inferred. If uplo ==
'L' or 'l', the symmetric matrix A is stored in lower storage mode;
that is, only the lower triangular part of A is referenced while the upper
triangular part of A is inferred.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to A * x.
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', the leading nn upper triangular part of the array A must contain
the upper triangular part of the symmetric matrix, and the strictly
lower triangular part of A is not referenced. If uplo == 'L' or 'l',
the leading nn lower triangular part of the array A must contain the
lower triangular part of the symmetric matrix, and the strictly upper
triangular part of A is not referenced.
lda
leading dimension of A; lda must be at least max(1, n).
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y.
y
single-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
146 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDsyr( )
void
cublasDsyr (char uplo, int n, double alpha,
const double *x, int incx, double *A,
int lda)
performsthesymmetricrank1operation
wherealphaisadoubleprecisionscalar,xisannelementdouble
precisionvector,andAisannnsymmetricmatrixconsistingof
doubleprecisionelements.Aisstoredincolumnmajorformat,and
ldaistheleadingdimensionofthetwodimensionalarray
containingA.
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced. If uplo == 'L' or 'l', only the
lower triangular part of A is referenced.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to .
x
double-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', A contains the upper triangular part of the symmetric matrix, and
the strictly lower triangular part is not referenced. If uplo == 'L' or
'l', A contains the lower triangular part of the symmetric matrix, and
the strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
A alpha x x
T
A + * * =
x * x
T
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 147
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/dsyr.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDsyr2( )
void
cublasDsyr2 (char uplo, int n, double alpha,
const double *x, int incx, const double *y,
int incy, double *A, int lda)
performsthesymmetricrank2operation
wherealphaisadoubleprecisionscalar,xandyarenelement
doubleprecisionvectors,andAisannnsymmetricmatrixconsisting
ofdoubleprecisionelements.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
A alpha x x
T
A + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced and the lower triangular part of A is
inferred. If uplo == 'L' or 'l', only the lower triangular part of A is
referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to and .
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
y
double-precision array of length at least .
A alpha x y
T
alpha y x
T
A + * * + * * =
x y
T
* y x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
148 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt bmv( )
void
cublasDtbmv (char uplo, char trans, char diag, int n,
int k, const double *A, int lda, double *x,
int incx)
performsoneofthematrixvectoroperations
xisannelementdoubleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularbandmatrixconsistingofdouble
precisionelements.
incy
storage spacing between elements of y; incy must not be zero.
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', A contains the upper triangular part of the symmetric matrix, and
the strictly lower triangular part is not referenced. If uplo == 'L' or
'l', A contains the lower triangular part of the symmetric matrix, and
the strictly upper triangular part is not referenced.
lda
leading dimension of A; lda must be at least max(1, n).
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
A alpha x y
T
alpha y x
T
A + * * + * * =
,
where or ,
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 149
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions

Reference:http://www.netlib.org/blas/dtbmv.f
I nput
uplo
specifies whether the matrix A is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular. If diag == 'U' or
'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
k
specifies the number of superdiagonals or subdiagonals. If uplo ==
'U' or 'u', k specifies the number of superdiagonals. If uplo == 'L'
or 'l' k specifies the number of subdiagonals; k must at least be zero.
A
double-precision array of dimension (lda, n). If uplo == 'U' or
'u', the leading (k+1)n part of the array A must contain the upper
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row k+1 of the array, the first superdiagonal
starting at position 2 in row k, and so on. The top left kk triangle of
the array A is not referenced. If uplo == 'L' or 'l', the leading
(k+1)n part of the array A must contain the lower triangular band
matrix, supplied column by column, with the leading diagonal of the
matrix in row 1 of the array, the first subdiagonal starting at position 1
in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
lda
is the leading dimension of A; lda must be at least k+1.
x
double-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
150 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt bsv( )
void
cublasDtbsv (char uplo, char trans, char diag, int n,
int k, const double *A, int lda, double *X,
int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularbandmatrixwithk+1diagonals.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0, k<0, or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
where or ,
I nput
uplo
specifies whether the matrix is an upper or lower triangular band
matrix: If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
the number of rows and columns of matrix A; n must be at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 151
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/dtbsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
k
specifies the number of superdiagonals or subdiagonals.
If uplo == 'U' or 'u', k specifies the number of superdiagonals. If
uplo == 'L' or 'l', k specifies the number of subdiagonals; k must
be at least zero.
A
double-precision array of dimension (lda, n). If uplo == 'U' or
'u', the leading (k+1)n part of the array A must contain the upper
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row k+1 of the array, the first superdiagonal
starting at position 2 in row k, and so on. The top left kk triangle of
the array A is not referenced. If uplo == 'L' or 'l', the leading
(k+1)n part of the array A must contain the lower triangular band
matrix, supplied column by column, with the leading diagonal of the
matrix in row 1 of the array, the first sub-diagonal starting at position
1 in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n< 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
152 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDt pmv( )
void
cublasDtpmv (char uplo, char trans, char diag, int n,
const double *AP, double *x, int incx)
performsoneofthematrixvectoroperations
xisannelementdoubleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularmatrixconsistingofdoubleprecision
elements.
,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
double-precision array with at least elements. If
uplo == 'U' or 'u', the array AP contains the upper triangular part of
the symmetric matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x
double-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 153
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/dtpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt psv( )
void
cublasDtpsv (char uplo, char trans, char diag, int n,
const double *AP, double *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularmatrix.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
where or ,
I nput
uplo
specifies whether the matrix is an upper or lower triangular matrix. If
uplo == 'U' or 'u', A is an upper triangular matrix. If uplo == 'L'
or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
154 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtpsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt rmv( )
void
cublasDtrmv (char uplo, char trans, char diag, int n,
const double *A, int lda, double *x,
int incx)
performsoneofthematrixvectoroperations
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
double-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular matrix A,
packed sequentially, column by column; that is, if i <= j, A[i,j] is
stored in . If uplo == 'L' or 'l', array AP
contains the lower triangular matrix A, packed sequentially, column by
column; that is, if i >= j, A[i,j] is stored in
. When diag == 'U' or 'u', the
diagonal elements of A are not referenced and are assumed to be unity.
x
double-precision array of length at least .
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n< 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where or ,
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 155
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
xisannelementdoubleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularmatrixconsistingofdoubleprecision
elements.
Reference:http://www.netlib.org/blas/dtrmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is an lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', the leading nn upper triangular part of the array A must contain
the upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading nn lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
either, but are assumed to be unity.
lda
leading dimension of A; lda must be at least max(1, n).
x
double-precision array of length at least .
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
156 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDt rsv( )
void
cublasDtrsv (char uplo, char trans, char diag, int n,
const double *A, int lda, double *x,
int incx)
solvesasystemofequations
bandxarenelementdoubleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrixconsistingofdouble
precisionelements.MatrixAisstoredincolumnmajorformat,and
ldaistheleadingdimensionofthetwodimensionalarray
containingA.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
,
where or ,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced. If uplo == 'L' or 'l', only
the lower triangular part of A may be referenced.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 157
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/dtrsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', A contains the upper triangular part of the symmetric matrix, and
the strictly lower triangular part is not referenced. If uplo == 'L' or
'l', A contains the lower triangular part of the symmetric matrix, and
the strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
x
double-precision array of length at least .
On entry, x contains the n-element, right-hand-side vector b. On exit,
it is overwritten with the solution vector x.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
158 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Double- Precision Complex BLAS2 funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
TwodoubleprecisioncomplexBLAS2functionsareimplemented:
FunctioncublasZgbmv()onpage 159
FunctioncublasZgemv()onpage 161
FunctioncublasZgerc()onpage 162
FunctioncublasZgeru()onpage 163
FunctioncublasZhbmv()onpage 165
FunctioncublasZhemv()onpage 167
FunctioncublasZher()onpage 168
FunctioncublasZher2()onpage 170
FunctioncublasZhpmv()onpage 171
FunctioncublasZhpr()onpage 173
FunctioncublasZhpr2()onpage 174
FunctioncublasZtbmv()onpage 175
FunctioncublasZtbsv()onpage 177
FunctioncublasZtpmv()onpage 179
FunctioncublasZtpsv()onpage 180
FunctioncublasZtrmv()onpage 182
FunctioncublasZtrsv()onpage 183
PG- 05326- 032_V02 159
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZgbmv( )
void
cublasZgbmv (char trans, int m, int n, int kl, int ku,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
cuDoubleComplex beta, cuDoubleComplex *y,
int incy)
performsoneofthematrixvectoroperations
alphaandbetaaredoubleprecisioncomplexscalars,andxandyare
doubleprecisioncomplexvectors.Aisan mnbandmatrixconsisting
ofdoubleprecisioncomplexelementswithklsubdiagonalsandku
superdiagonals.
,where
, ,or ;
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
kl
specifies the number of subdiagonals of matrix A; kl must be at least
zero.
ku
specifies the number of superdiagonals of matrix A; ku must be at
least zero.
alpha
double-precision complex scalar multiplier applied to op(A).
A
double-precision complex array of dimensions (lda, n). The leading
(kl+ku+1)n part of the array A must contain the band matrix A,
supplied column by column, with the leading diagonal of the matrix in
row (ku+1) of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row (ku+2), and
so on. Elements in the array A that do not correspond to elements in
the band matrix (such as the top left kuku triangle) are not
referenced.
lda
leading dimension A; lda must be at least (kl+ku+1).
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
160 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zgbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
x
double-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incx
specifies the increment for the elements of x; incx must not be zero.
beta
double-precision complex scalar multiplier applied to vector y. If beta
is zero, y is not read.
y
double-precision complex array of length at least
if trans == 'N' or 'n', else at least
. If beta is zero, y is not read.
incy
on entry, incy specifies the increment for the elements of y; incy
must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
PG- 05326- 032_V02 161
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZgemv( )
void
cublasZgemv (char trans, int m, int n,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
cuDoubleComplex beta, cuDoubleComplex *y,
int incy)
performsoneofthematrixvectoroperations
alphaandbetaaredoubleprecisioncomplexscalars;andxandyare
doubleprecisioncomplexvectors.Aisanmnmatrixconsistingof
doubleprecisioncomplexelements.MatrixAisstoredincolumn
majorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayinwhichAisstored.
,
where , ,or ;
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to op(A).
A
double-precision complex array of dimensions (lda, n) if trans ==
'N' or 'n', of dimensions (lda, m) otherwise; lda must be at least
max(1, m) if trans == 'N' or 'n' and at least max(1, n) otherwise.
lda
leading dimension of two-dimensional array used to store matrix A.
x
double-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
double-precision complex scalar multiplier applied to vector y. If beta
is zero, y is not read.
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
162 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zgemv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZgerc( )
void
cublasZgerc (int m, int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
performsthesymmetricrank1operation
wherealphaisadoubleprecisioncomplexscalar,xisanmelement
doubleprecisioncomplexvector,yisannelementdoubleprecision
complexvector,andAisanmnmatrixconsistingofdoubleprecision
complexelements.MatrixAisstoredincolumnmajorformat,andlda
istheleadingdimensionofthetwodimensionalarrayusedtostoreA.
y
double-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incy
the storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha op A ( ) x beta y * + * * =
, A = alpha * x * y
H
+ A
PG- 05326- 032_V02 163
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions

Reference:http://www.netlib.org/blas/zgerc.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZgeru( )
void
cublasZgeru (int m, int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
performsthesymmetricrank1operation
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to .
x
double-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
double-precision complex array of length at least
.
incy
the storage spacing between elements of y; incy must not be zero.
A
double-precision complex array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x * y
H
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
H
A + * * =
, A = alpha * x * y
T
+ A
164 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisadoubleprecisioncomplexscalar,xisanmelement
doubleprecisioncomplexvector,yisannelementdoubleprecision
complexvector,andAisanmnmatrixconsistingofdoubleprecision
complexelements.MatrixAisstoredincolumnmajorformat,andlda
istheleadingdimensionofthetwodimensionalarrayusedtostoreA.
Reference:http://www.netlib.org/blas/zgeru.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to .
x
double-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
double-precision complex array of length at least
.
incy
the storage spacing between elements of y; incy must not be zero.
A
double-precision complex array of dimensions (lda, n).
lda
leading dimension of two-dimensional array used to store matrix A.
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
PG- 05326- 032_V02 165
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZhbmv( )
void
cublasZhbmv (char uplo, int n, int k,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
cuDoubleComplex beta, cuDoubleComplex *y,
int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisioncomplexscalars,andx
andyarenelementdoubleprecisioncomplexvectors.Aisa
Hermitiannnbandmatrixthatconsistsofdoubleprecisioncomplex
elements,withksuperdiagonalsandthesamenumberof
subdiagonals.
,
I nput
uplo
specifies whether the upper or lower triangular part of the Hermitian
band matrix A is being supplied. If uplo == 'U' or 'u', the upper
triangular part is being supplied. If uplo == 'L' or 'l', the lower
triangular part is being supplied.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
k
specifies the number of superdiagonals of matrix A. Since the matrix is
Hermitian, this is also the number of subdiagonals; k must be at least
zero.
alpha
double-precision complex scalar multiplier applied to A * x.
y alpha A x beta y * + * * =
166 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading (k + 1)n part of array A must contain the
upper triangular band of the Hermitian matrix, supplied column by
column, with the leading diagonal of the matrix in row k + 1 of the
array, the first superdiagonal starting at position 2 in row k, and so on.
The top left kk triangle of array A is not referenced. When uplo ==
'L' or 'l', the leading (k + 1)n part of array A must contain the
lower triangular band part of the Hermitian matrix, supplied column
by column, with the leading diagonal of the matrix in row 1 of the
array, the first subdiagonal starting at position 1 in row 2, and so on.
The bottom right kk triangle of array A is not referenced. The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero.
lda
leading dimension of A; lda must be at least k + 1.
x
double-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
beta
double-precision complex scalar multiplier applied to vector y.
y
double-precision complex array of length at least
. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if k < 0, n < 0, incx == 0, or
incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
PG- 05326- 032_V02 167
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZhemv( )
void
cublasZhemv (char uplo, int n, cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *x, int incx,
cuDoubleComplex beta, cuDoubleComplex *y,
int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisioncomplexscalars,andx
andyarenelementdoubleprecisioncomplexvectors.Aisa
Hermitiannnmatrixthatconsistsofdoubleprecisioncomplex
elementsandisstoredineitherupperorlowerstoragemode.
,
I nput
uplo
specifies whether the upper or lower triangular part of the array A is
referenced. If uplo == 'U' or 'u', the Hermitian matrix A is stored in
upper storage mode; that is, only the upper triangular part of A is
referenced while the lower triangular part of A is inferred. If uplo ==
'L' or 'l', the Hermitian matrix A is stored in lower storage mode;
that is, only the lower triangular part of A is referenced while the upper
triangular part of A is inferred.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to A * x.
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading nn upper triangular part of the array A must
contain the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced. If uplo == 'L' or
'l', the leading nn lower triangular part of the array A must contain
the lower triangular part of the Hermitian matrix, and the strictly
upper triangular part of A is not referenced. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero.
lda
leading dimension of A; lda must be at least max(1, n).
x
double-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
y alpha A x beta y * + * * =
1 n 1 ( ) abs incx ( ) * + ( )
168 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhemv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZher( )
void
cublasZher (char uplo, int n, double alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *A, int lda)
performstheHermitianrank1operation
wherealphaisadoubleprecisionscalar,xisannelementdouble
precisioncomplexvector,andAisannnHermitianmatrixconsisting
ofdoubleprecisioncomplexelements.Aisstoredincolumnmajor
format,andldaistheleadingdimensionofthetwodimensionalarray
containingA.
beta
double-precision complex scalar multiplier applied to vector y.
y
double-precision complex array of length at least
. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
, A alpha x x
H
A + * * =
PG- 05326- 032_V02 169
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions

Reference:http://www.netlib.org/blas/zher.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A is referenced. If uplo == 'L' or 'l', only the
lower triangular part of A is referenced.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to .
x
double-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the Hermitian
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the Hermitian
matrix, and the strictly upper triangular part is not referenced. The
imaginary parts of the diagonal elements need not be set, they are
assumed to be zero, and on exit they are set to zero.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
Out put
A
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
A alpha x x
H
A + * * =
170 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZher2( )
void
cublasZher2 (char uplo, int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
performstheHermitianrank2operation
wherealphaisadoubleprecisioncomplexscalar,xandyaren
elementdoubleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofdoubleprecisioncomplexelements.

,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to
and whose conjugate is applied to .
x
double-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
y
double-precision array of length at least .
incy
the storage spacing between elements of y; incy must not be zero.
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the Hermitian
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the Hermitian
matrix, and the strictly upper triangular part is not referenced. The
imaginary parts of the diagonal elements need not be set, they are
assumed to be zero, and on exit they are set to zero.
lda
leading dimension of A; lda must be at least max(1, n).
A alpha x y
H
alpha y x
H
A + * * + * * =
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 171
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/zher2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZhpmv( )
void
cublasZhpmv (char uplo, int n, cuDoubleComplex alpha,
const cuDoubleComplex *AP,
const cuDoubleComplex *x, int incx,
cuDoubleComplex beta,
cuDoubleComplex *y, int incy)
performsthematrixvectoroperation
wherealphaandbetaaredoubleprecisioncomplexscalars,andx
andyarenelementdoubleprecisioncomplexvectors.Aisa
Hermitiannnmatrixthatconsistsofdoubleprecisioncomplex
elementsandissuppliedinpackedform.
Out put
A
updated according to
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
A alpha x y
H
alpha y x
H
A + * * + * * =
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to A * x.
y alpha A x beta y * + * * =
172 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
AP
double-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero.
x
double-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y.
y
double-precision array of length at least .
If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
PG- 05326- 032_V02 173
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZhpr( )
void
cublasZhpr (char uplo, int n, double alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *AP)
performstheHermitianrank1operation
wherealphaisadoubleprecisionscalar,xisannelementdouble
precisioncomplexvector,andAisannnHermitianmatrixconsisting
ofdoubleprecisioncomplexelementsthatissuppliedinpackedform.
Reference:http://www.netlib.org/blas/zhpr.f
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to .
x
double-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
AP
double-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero, and on exit they are set to zero.
Out put
A
updated according to .
A alpha x x
H
A + * * =
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
H
A + * * =
174 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZhpr2( )
void
cublasZhpr2 (char uplo, int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *AP)
performstheHermitianrank2operation
wherealphaisadoubleprecisioncomplexscalar,xandyaren
elementdoubleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofdoubleprecisioncomplexelementsthatis
suppliedinpackedform.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or incx == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to
and whose conjugate is applied to .
x
double-precision complex array of length at least
.
incx
the storage spacing between elements of x; incx must not be zero.
y
double-precision complex array of length at least
.
A alpha x y
H
alpha y x
H
A + * * + * * =
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 175
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/zhpr2.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt bmv( )
void
cublasZtbmv (char uplo, char trans, char diag, int n,
int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
performsoneofthematrixvectoroperations
incy
the storage spacing between elements of y; incy must not be zero.
AP
double-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero, and on exit they are set to zero.
Out put
A
updated according to
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0, incx == 0, or incy == 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
H
alpha y x
H
A + * * + * * =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
176 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
xisannelementdoubleprecisioncomplexvector,andAisannn,
unitornonunit,upperorlower,triangularbandmatrixconsistingof
doubleprecisioncomplexelements.
I nput
uplo
specifies whether the matrix A is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular. If diag == 'U' or
'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
k
specifies the number of superdiagonals or subdiagonals. If uplo ==
'U' or 'u', k specifies the number of superdiagonals. If uplo == 'L'
or 'l' k specifies the number of subdiagonals; k must at least be zero.
A
double-precision complex array of dimension (lda, n). If uplo ==
'U' or 'u', the leading (k+1)n part of the array A must contain the
upper triangular band matrix, supplied column by column, with the
leading diagonal of the matrix in row k+1 of the array, the first
superdiagonal starting at position 2 in row k, and so on. The top left
kk triangle of the array A is not referenced. If uplo == 'L' or 'l',
the leading (k+1)n part of the array A must contain the lower
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row 1 of the array, the first subdiagonal
starting at position 1 in row 2, and so on. The bottom right kk
triangle of the array is not referenced.
lda
is the leading dimension of A; lda must be at least k+1.
x
double-precision complex array of length at least
.
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
Out put
x
updated according to .
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 177
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ztbmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt bsv( )
void
cublasZtbsv (char uplo, char trans, char diag, int n,
int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularbandmatrixwithk+1diagonals.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0, k<0, or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
,
where , ,or ;
I nput
uplo
specifies whether the matrix is an upper or lower triangular band
matrix: If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
178 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztbsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
the number of rows and columns of matrix A; n must be at least zero.
k
specifies the number of superdiagonals or subdiagonals.
If uplo == 'U' or 'u', k specifies the number of superdiagonals. If
uplo == 'L' or 'l', k specifies the number of subdiagonals; k must
be at least zero.
A
double-precision complex array of dimension (lda, n). If uplo ==
'U' or 'u', the leading (k+1)n part of the array A must contain the
upper triangular band matrix, supplied column by column, with the
leading diagonal of the matrix in row k+1 of the array, the first
superdiagonal starting at position 2 in row k, and so on. The top left
kk triangle of the array A is not referenced. If uplo == 'L' or 'l',
the leading (k+1)n part of the array A must contain the lower
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row 1 of the array, the first sub-diagonal
starting at position 1 in row 2, and so on. The bottom right kk
triangle of the array is not referenced.
x
double-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n< 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 179
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasZt pmv( )
void
cublasZtpmv (char uplo, char trans, char diag, int n,
const cuDoubleComplex *AP,
cuDoubleComplex *x, int incx)
performsoneofthematrixvectoroperations
xisannelementdoubleprecisioncomplexvector,andAisannn,
unitornonunit,upperorlower,triangularmatrixconsistingof
doubleprecisioncomplexelements.
,
where , ,or ;
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
double-precision complex array with at least elements.
If uplo == 'U' or 'u', the array AP contains the upper triangular part
of the symmetric matrix A, packed sequentially, column by column;
that is, if i <= j, A[i,j] is stored in . If
uplo == 'L' or 'l', array AP contains the lower triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in .
x
double-precision complex array of length at least
. On entry, x contains the source vector.
On exit, x is overwritten with the result vector.
incx
specifies the storage spacing for elements of x; incx must not be zero.
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
180 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztpmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt psv( )
void
cublasZtpsv (char uplo, char trans, char diag, int n,
const cuDoubleComplex *AP,
cuDoubleComplex *X, int incx)
solvesoneofthesystemsofequations
bandxarenelementcomplexvectors,andAisannn,unitornon
unit,upperorlower,triangularmatrix.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
x op A ( ) x * =
,
where , ,or ;
I nput
uplo
specifies whether the matrix is an upper or lower triangular matrix. If
uplo == 'U' or 'u', A is an upper triangular matrix. If uplo == 'L'
or 'l', A is a lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 181
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ztpsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
AP
double-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular matrix
A, packed sequentially, column by column; that is, if i <= j, A[i,j] is
stored in . If uplo == 'L' or 'l', array AP
contains the lower triangular matrix A, packed sequentially, column by
column; that is, if i >= j, A[i,j] is stored in
. When diag == 'U' or 'u', the
diagonal elements of A are not referenced and are assumed to be unity.
x
double-precision complex array of length at least
.
incx
storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
182 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZt rmv( )
void
cublasZtrmv (char uplo, char trans, char diag, int n,
const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
performsoneofthematrixvectoroperations
xisannelementdoubleprecisioncomplexvector;andAisannn,
unitornonunit,upperorlower,triangularmatrixconsistingof
doubleprecisioncomplexelements.
,
where , ,or ;
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is an lower triangular matrix.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading nn upper triangular part of the array A must
contain the upper triangular matrix, and the strictly lower triangular
part of A is not referenced. If uplo == 'L' or 'l', the leading nn
lower triangular part of the array A must contain the lower triangular
matrix, and the strictly upper triangular part of A is not referenced.
When diag == 'U' or 'u', the diagonal elements of A are not
referenced either, but are assumed to be unity.
lda
leading dimension of A; lda must be at least max(1, n).
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 183
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Reference:http://www.netlib.org/blas/ztrmv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt rsv( )
void
cublasZtrsv (char uplo, char trans, char diag, int n,
const cuDoubleComplex *A, int lda,
cuDoubleComplex *x, int incx)
solvesasystemofequations
bandxarenelementdoubleprecisioncomplexvectors,andAisan
nn,unitornonunit,upperorlower,triangularmatrixconsistingof
singleprecisionelements.MatrixAisstoredincolumnmajorformat,
andldaistheleadingdimensionofthetwodimensionalarray
containingA.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
x
double-precision complex array of length at least
. On entry, x contains the source vector.
On exit, x is overwritten with the result vector.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
CUBLAS_STATUS_ALLOC_FAILED
if function cannot allocate enough
internal scratch vector memory
I nput ( cont inued)
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where , ,or ;
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
184 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

Reference:http://www.netlib.org/blas/ztrsv.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced. If uplo == 'L' or 'l', only
the lower triangular part of A may be referenced.
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T' or 't', .
If trans == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', A contains the upper triangular part of the symmetric
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
x
double-precision complex array of length at least
. On entry, x contains the n-element, right-
hand-side vector b. On exit, it is overwritten with solution vector x.
incx
the storage spacing between elements of x; incx must not be zero.
Out put
x
updated to contain the solution vector x that solves .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if incx == 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 185
NVI DI A
CH A P T E R
5
BLAS3 Funct i ons
Level3BasicLinearAlgebraSubprograms(BLAS3)performmatrix
matrixoperations.TheCUBLASimplementationsaredescribedinthe
followingsections:
SinglePrecisionBLAS3Functionsonpage 186
SinglePrecisionComplexBLAS3Functionsonpage 199
DoublePrecisionBLAS3Functionsonpage 218
DoublePrecisionComplexBLAS3Functionsonpage 231
186 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Single- Precision BLAS3 Funct ions
ThesingleprecisionBLAS3functionsarelistedbelow:
FunctioncublasSgemm()onpage 187
FunctioncublasSsymm()onpage 188
FunctioncublasSsyrk()onpage 190
FunctioncublasSsyr2k()onpage 192
FunctioncublasStrmm()onpage 194
FunctioncublasStrsm()onpage 196
PG- 05326- 032_V02 187
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Funct ion cublasSgemm( )
void
cublasSgemm (char transa, char transb, int m, int n,
int k, float alpha, const float *A, int lda,
const float *B, int ldb, float beta,
float *C, int ldc)
computestheproductofmatrixAandmatrixB,multipliestheresult
byscalaralpha,andaddsthesumtotheproductofmatrixCand
scalarbeta.Itperformsoneofthematrixmatrixoperations:
andalphaandbetaaresingleprecisionscalars.A,B,andCare
matricesconsistingofsingleprecisionelements,withop(A)anmk
matrix,op(B)aknmatrix,andCanmnmatrix.MatricesA,B,andC
arestoredincolumnmajorformat,andlda,ldb,andldcarethe
leadingdimensionsofthetwodimensionalarrayscontainingA,B,
and C.
,
where or ,
I nput
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
transb
specifies op(B). If transb == 'N' or 'n', .
If transb == 'T', 't', 'C', or 'c', .
m
number of rows of matrix op(A) and rows of matrix C; m must be at
least zero.
n
number of columns of matrix op(B) and number of columns of C;
n must be at least zero.
k
number of columns of matrix op(A) and number of rows of op(B);
k must be at least zero.
alpha
single-precision scalar multiplier applied to .
A
single-precision array of dimensions (lda, k) if transa == 'N' or
'n', and of dimensions (lda, m) otherwise. If transa == 'N' or
'n', lda must be at least max(1, m); otherwise, lda must be at least
max(1, k).
lda
leading dimension of two-dimensional array used to store matrix A.
C alpha * op(A) * op(B) beta * C + =
op X ( ) X = op X ( ) X
T
=
op A ( ) A =
op A ( ) A
T
=
op B ( ) B =
op B ( ) B
T
=
op A ( ) op B ( ) *
188 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sgemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsymm( )
void
cublasSsymm (char side, char uplo, int m, int n,
float alpha, const float *A, int lda,
const float *B, int ldb, float beta,
float *C, int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaresingleprecisionscalars,Aisasymmetric
matrixconsistingofsingleprecisionelementsandisstoredineither
B
single-precision array of dimensions (ldb, n) if transb == 'N' or
'n', and of dimensions (ldb, k) otherwise. If transb == 'N' or
'n', ldb must be at least max(1, k); otherwise, ldb must be at least
max(1, n).
ldb
leading dimension of two-dimensional array used to store matrix B.
beta
single-precision scalar multiplier applied to C. If zero, C does not have
to be a valid input.
C
single-precision array of dimensions (ldc, n); ldc must be at least
max (1, m).
ldc
leading dimension of two-dimensional array used to store matrix C.
Out put
C
updated based on .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha * op(A) * op(B) beta * C + =
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
PG- 05326- 032_V02 189
NVI DI A
CHAPTER 5 BLAS3 Funct ions
lowerorupperstoragemode.BandCaremnmatricesconsistingof
singleprecisionelements.
I nput
side
specifies whether the symmetric matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the symmetric matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of symmetric matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of symmetric
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
single-precision scalar multiplier applied to A * B or B * A.
A
single-precision array of dimensions (lda, ka), where ka is m when
side == 'L' or 'l' and is n otherwise. If side == 'L' or 'l', the
leading mm part of array A must contain the symmetric matrix such
that when uplo == 'U' or 'u', the leading mm part stores the upper
triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced; and when uplo == 'L' or 'l',
the leading mm part stores the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced. If
side == 'R' or 'r', the leading nn part of array A must contain the
symmetric matrix such that when uplo == 'U' or 'u', the leading
nn part stores the upper triangular part of the symmetric matrix, and
the strictly lower triangular part of A is not referenced; and when
uplo == 'L' or 'l', the leading nn part stores the lower triangular
part of the symmetric matrix, and the strictly upper triangular part is
not referenced.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
190 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssymm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsyrk( )
void
cublasSsyrk (char uplo, char trans, int n, int k,
float alpha, const float *A, int lda,
float beta, float *C, int ldc)
performsoneofthesymmetricrankkoperations
wherealphaandbetaaresingleprecisionscalars.Cisannn
symmetricmatrixconsistingofsingleprecisionelementsandisstored
ineitherlowerorupperstoragemode.Aisamatrixconsistingof
singleprecisionelementswithdimensionsofnkinthefirstcaseand
kninthesecondcase.
B
single-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
single-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C
single-precision array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
or , C alpha A A
T
beta C * + * * = C alpha A
T
A * beta C * + * =
PG- 05326- 032_V02 191
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
single-precision scalar multiplier applied to or .
A
single-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A contains the matrix A; otherwise,
the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
single-precision scalar multiplier applied to C.
If beta is zero, C is not read.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
192 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyrk.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSsyr2k( )
void
cublasSsyr2k (char uplo, char trans, int n, int k,
float alpha, const float *A, int lda,
const float *B, int ldb, float beta,
float *C, int ldc)
performsoneofthesymmetricrank2koperations
wherealphaandbetaaresingleprecisionscalars.Cisannn
symmetricmatrixconsistingofsingleprecisionelementsandisstored
ineitherlowerorupperstoragemode.AandBarematricesconsisting
C
single-precision array of dimensions (ldc, n). If uplo == 'U' or 'u',
the leading nn triangular part of the array C must contain the upper
triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 193
NVI DI A
CHAPTER 5 BLAS3 Funct ions
ofsingleprecisionelementswithdimensionofnkinthefirstcaseand
kninthesecondcase.

I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n'', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of matrix
A. If trans == 'T', 't', 'C', or 'c', k specifies the number of rows
of matrix A; k must be at least zero.
alpha
single-precision scalar multiplier.
A
single-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A must contain the matrix A,
otherwise the leading kn part of the array must contain the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
single-precision array of dimensions (ldb, kb), where kb = k when
trans == 'N' or 'n', and k = n otherwise. When trans == 'N' or
'n', the leading nk part of array B must contain the matrix B,
otherwise the leading kn part of the array must contain the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
single-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
194 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyr2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt rmm( )
void
cublasStrmm (char side, char uplo, char transa,
char diag, int m, int n, float alpha,
const float *A, int lda, const float *B,
int ldb)
performsoneofthematrixmatrixoperations
C
single-precision array of dimensions (ldc, n). If uplo == 'U' or 'u',
the leading nn triangular part of the array C must contain the upper
triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; idc must be at least max(1, n).
Out put
C
updated according to
or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where or ,
B alpha op A ( ) B * * = B alpha B op * A ( ) * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 195
NVI DI A
CHAPTER 5 BLAS3 Funct ions
alphaisasingleprecisionscalar,Bisanmnmatrixconsistingof
singleprecisionelements,andAisaunitornonunit,upperorlower
triangularmatrixconsistingofsingleprecisionelements.
MatricesAandBarestoredincolumnmajorformat,andldaandldb
aretheleadingdimensionsofthetwodimensionalarraysthatcontain
AandB,respectively.

I nput
side
specifies whether op(A) multiplies B from the left or right.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
transa
specifies the form of op(A) to be used in the matrix multiplication.
If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
m
the number of rows of matrix B; m must be at least zero.
n
the number of columns of matrix B; n must be at least zero.
alpha
single-precision scalar multiplier applied to or ,
respectively. If alpha is zero, no accesses are made to matrix A, and
no read accesses are made to matrix B.
A
single-precision array of dimensions (lda, k). If side == 'L' or 'l',
k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or 'u', the
leading kk upper triangular part of the array A must contain the
upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading kk lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
and are assumed to be unity.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
op A ( ) A =
op A ( ) A
T
=
op(A)*B B*op(A)
196 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strmm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasSt rsm( )
void
cublasStrsm (char side, char uplo, char transa,
char diag, int m, int n, float alpha,
const float *A, int lda, float *B, int ldb)
solvesoneofthematrixequations
alphaisasingleprecisionscalar,andXandBaremnmatricesthat
consistofsingleprecisionelements.Aisaunitornonunit,upperor
lower,triangularmatrix.
TheresultmatrixXoverwritesinputmatrixB;thatis,onexittheresult
isstoredinB.MatricesAandBarestoredincolumnmajorformat,and
ldaandldbaretheleadingdimensionsofthetwodimensionalarrays
thatcontainAandB,respectively.
B
single-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B. It is overwritten with the
transformed matrix on exit.
ldb
leading dimension of B; ldb must be at least max(1, m).
Out put
B
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
or ,
where or ,
op A ( ) X alpha B * = * X op A ( ) alpha B * = *
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 197
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
side
specifies whether op(A) appears on the left or right of X:
side == 'L' or 'l' indicates solve ;
side == 'R' or 'r' indicates solve .
uplo
specifies whether the matrix A is an upper or lower triangular matrix:
uplo == 'U' or 'u' indicates A is an upper triangular matrix;
uplo == 'L' or 'l' indicates A is a lower triangular matrix.
transa
specifies the form of op(A) to be used in matrix multiplication.
If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
m
specifies the number of rows of B; m must be at least zero.
n
specifies the number of columns of B; n must be at least zero.
alpha
single-precision scalar multiplier applied to B. When alpha is zero, A is
not referenced and B does not have to be a valid input.
A
single-precision array of dimensions (lda, k), where k is m when
side == 'L' or 'l' and is n when side == 'R' or 'r'. If uplo ==
'U' or 'u', the leading kk upper triangular part of the array A must
contain the upper triangular matrix, and the strictly lower triangular
matrix of A is not referenced. When uplo == 'L' or 'l', the leading
kk lower triangular part of the array A must contain the lower
triangular matrix, and the strictly upper triangular part of A is not
referenced. Note that when diag == 'U' or 'u', the diagonal
elements of A are not referenced and are assumed to be unity.
lda
leading dimension of the two-dimensional array containing A.
When side == 'L' or 'l', lda must be at least max(1, m).
When side == 'R' or 'r', lda must be at least max(1, n).
B
single-precision array of dimensions (ldb, n); ldb must be at least
max(1, m). The leading mn part of the array B must contain the right-
hand side matrix B. On exit B is overwritten by the solution matrix X.
ldb
leading dimension of the two-dimensional array containing B; ldb
must be at least max(1, m).
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
op A ( ) A =
op A ( ) A
T
=
198 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strsm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
B
contains the solution matrix X satisfying or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
PG- 05326- 032_V02 199
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Single- Precision Complex BLAS3 Funct ions
ThesearethesingleprecisioncomplexBLAS3functions:
FunctioncublasCgemm()onpage 200
FunctioncublasChemm()onpage 201
FunctioncublasCherk()onpage 203
FunctioncublasCher2k()onpage 205
FunctioncublasCsymm()onpage 207
FunctioncublasCsyrk()onpage 209
FunctioncublasCsyr2k()onpage 211
FunctioncublasCtrmm()onpage 213
FunctioncublasCtrsm()onpage 215
200 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCgemm( )
void
cublasCgemm (char transa, char transb, int m, int n,
int k, cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
cuComplex beta, cuComplex *C, int ldc)
performsoneofthematrixmatrixoperations
andalphaandbetaaresingleprecisioncomplexscalars.A,B,andC
arematricesconsistingofsingleprecisioncomplexelements,with
op(A)anmkmatrix,op(B)aknmatrix,andCanmnmatrix.
,
where , ,or ;
I nput
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
transb
specifies op(B). If transb == 'N' or 'n', .
If transb == 'T' or 't', .
If transb == 'C' or 'c', .
m
number of rows of matrix op(A) and rows of matrix C;
m must be at least zero.
n
number of columns of matrix op(B) and number of columns of C;
n must be at least zero.
k
number of columns of matrix op(A) and number of rows of op(B);
k must be at least zero.
alpha
single-precision complex scalar multiplier applied to .
A
single-precision complex array of dimension (lda, k) if transa ==
'N' or 'n', and of dimension (lda, m) otherwise.
lda
leading dimension of A. When transa == 'N' or 'n', it must be at
least max(1, m) and at least max(1, k) otherwise.
B
single-precision complex array of dimension (ldb, n) if transb ==
'N' or 'n', and of dimension (ldb, k) otherwise.
ldb
leading dimension of B. When transb == 'N' or 'n', it must be at
least max(1, k) and at least max(1, n) otherwise.
C alpha op A ( ) op B ( ) beta C * + * * =
op X ( ) X = op X ( ) X
T
= op X ( ) X
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op B ( ) B =
op B ( ) B
T
=
op B ( ) B
H
=
op(A)*op(B)
PG- 05326- 032_V02 201
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/cgemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasChemm( )
void
cublasChemm (char side, char uplo, int m, int n,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
cuComplex beta, cuComplex *C, int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaresingleprecisioncomplexscalars,Aisa
Hermitianmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofsingleprecisioncomplexelements.
beta
single-precision complex scalar multiplier applied to C. If beta is zero,
C does not have to be a valid input.
C
single-precision array of dimensions (ldc, n).
ldc
leading dimension of C; idc must be at least max(1, m).
Out put
C
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha op A ( ) op B ( ) beta C * + * * =
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
202 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

I nput
side
specifies whether the Hermitian matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the Hermitian matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of Hermitian matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of Hermitian
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to A * B or B * A.
A
single-precision complex array of dimensions (lda, ka), where ka is
m when side == 'L' or 'l' and is n otherwise. If side == 'L' or
'l', the leading mm part of array A must contain the Hermitian
matrix such that when uplo == 'U' or 'u', the leading mm part
stores the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced; and when uplo ==
'L' or 'l', the leading mm part stores the lower triangular part of the
Hermitian matrix, and the strictly upper triangular part is not
referenced. If side == 'R' or 'r', the leading nn part of array A
must contain the Hermitian matrix such that when uplo == 'U' or
'u', the leading nn part stores the upper triangular part of the
Hermitian matrix, and the strictly lower triangular part of A is not
referenced; and when uplo == 'L' or 'l', the leading nn part stores
the lower triangular part of the Hermitian matrix, and the strictly
upper triangular part is not referenced. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
PG- 05326- 032_V02 203
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/chemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCherk( )
void
cublasCherk (char uplo, char trans, int n, int k,
float alpha, const cuComplex *A,
int lda, float beta, cuComplex *C,
int ldc)
performsoneoftheHermitianrankkoperations
wherealphaandbetaaresingleprecisionrealscalars.Cisannn
Hermitianmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.Aisamatrixconsisting
ofsingleprecisioncomplexelementswithdimensionsofnkinthe
firstcaseandkninthesecondcase.
B
single-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
single-precision complex scalar multiplier applied to C. If beta is zero,
C does not have to be a valid input.
C
single-precision complex array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
or , C alpha A A
H
beta C * + * * = C alpha A
H
A * beta C * + * =
204 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

I nput
uplo
specifies whether the Hermitian matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
single-precision scalar multiplier applied to or .
A
single-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
single-precision real scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
A A
H
* A
H
A *
PG- 05326- 032_V02 205
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/cherk.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCher2k( )
void
cublasCher2k (char uplo, char trans, int n, int k,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
float beta, cuComplex *C, int ldc)
performsoneoftheHermitianrank2koperations
C
single-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the Hermitian matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
Hermitian matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero, and on
exit they are set to zero.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
or
,
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
206 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisioncomplexscalarandbetaisasingle
precisionrealscalar.CisannnHermitianmatrixconsistingofsingle
precisioncomplexelementsandisstoredineitherlowerorupper
storagemode.AandBarematricesconsistingofsingleprecision
complexelementswithdimensionsofnkinthefirstcaseandknin
thesecondcase.

I nput
uplo
specifies whether the Hermitian matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrices A
and B. If trans == 'T', 't', 'C', or 'c', n specifies the number of
columns of matrices A and B; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrices A and B. If trans == 'T', 't', 'C', or 'c', k specifies the
number of rows of matrices A and B; k must be at least zero.
alpha
single-precision complex scalar multiplier.
A
single-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
single-precision complex array of dimensions (ldb, kb), where kb is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array B contains the matrix B;
otherwise, the leading kn part of the array contains the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
PG- 05326- 032_V02 207
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/cher2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCsymm( )
void
cublasCsymm (char side, char uplo, int m, int n,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
cuComplex beta, cuComplex *C, int ldc)
performsoneofthematrixmatrixoperations
beta
single-precision real scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C
single-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the Hermitian matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
Hermitian matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero, and on
exit they are set to zero.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to
or .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B
H
alpha * B*A
H
+ beta*C + * * =
C alpha A
H
B alpha *B
H
*A beta *C + + * * =
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
208 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaandbetaaresingleprecisioncomplexscalars,Aisa
symmetricmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofsingleprecisioncomplexelements.
I nput
side
specifies whether the symmetric matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the symmetric matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of symmetric matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of symmetric
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to A * B or B * A.
A
single-precision complex array of dimensions (lda, ka), where ka is
m when side == 'L' or 'l' and is n otherwise. If side == 'L' or
'l', the leading mm part of array A must contain the symmetric
matrix such that when uplo == 'U' or 'u', the leading mm part
stores the upper triangular part of the symmetric matrix, and the
strictly lower triangular part of A is not referenced; and when uplo ==
'L' or 'l', the leading mm part stores the lower triangular part of the
symmetric matrix, and the strictly upper triangular part is not
referenced. If side == 'R' or 'r', the leading nn part of array A
must contain the symmetric matrix such that when uplo == 'U' or
'u', the leading nn part stores the upper triangular part of the
symmetric matrix, and the strictly lower triangular part of A is not
referenced; and when uplo == 'L' or 'l', the leading nn part stores
the lower triangular part of the symmetric matrix, and the strictly
upper triangular part is not referenced.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
PG- 05326- 032_V02 209
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/csymm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCsyrk( )
void
cublasCsyrk (char uplo, char trans, int n, int k,
cuComplex alpha, const cuComplex *A,
int lda, cuComplex beta, cuComplex *C,
int ldc)
performsoneofthesymmetricrankkoperations
wherealphaandbetaaresingleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofsingleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.Aisamatrix
consistingofsingleprecisioncomplexelementswithdimensionsof
nkinthefirstcaseandkninthesecondcase.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B
single-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
single-precision complex scalar multiplier applied to C. If beta is zero,
C does not have to be a valid input.
C
single-precision complex array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
or , C alpha A A
T
beta C * + * * = C alpha A
T
A * beta C * + * =
210 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
single-precision complex scalar multiplier applied to or .
A
single-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
single-precision complex scalar multiplier applied to C.
If beta is zero, C is not read.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
PG- 05326- 032_V02 211
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/csyrk.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCsyr2k( )
void
cublasCsyr2k (char uplo, char trans, int n, int k,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
cuComplex beta, cuComplex *C, int ldc)
performsoneofthesymmetricrank2koperations
wherealphaandbetaaresingleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofsingleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.AandBare
C
single-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the symmetric matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
symmetric matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
212 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
matricesconsistingofsingleprecisioncomplexelementswith
dimensionsofnkinthefirstcaseandkninthesecondcase.

I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
single-precision complex scalar multiplier.
A
single-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
single-precision complex array of dimensions (ldb, kb), where kb is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array B contains the matrix B;
otherwise, the leading kn part of the array contains the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
single-precision complex scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
PG- 05326- 032_V02 213
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/csyr2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCt rmm( )
void
cublasCtrmm (char side, char uplo, char transa,
char diag, int m, int n, cuComplex alpha,
const cuComplex *A, int lda,
const cuComplex *B, int ldb)
performsoneofthematrixmatrixoperations
alphaisasingleprecisioncomplexscalar;Bisanmnmatrix
consistingofsingleprecisioncomplexelements;andAisaunitornon
C
single-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the symmetric matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
symmetric matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to
or .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
or ,
where , ,or ;
B alpha op A ( ) B * * = B alpha B op * A ( ) * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
214 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
unit,upperorlowertriangularmatrixconsistingofsingleprecision
complexelements.
MatricesAandBarestoredincolumnmajorformat,andldaandldb
aretheleadingdimensionsofthetwodimensionalarraysthatcontain
AandB,respectively.
I nput
side
specifies whether op(A) multiplies B from the left or right.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
m
the number of rows of matrix B; m must be at least zero.
n
the number of columns of matrix B; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to or
, respectively. If alpha is zero, no accesses are made to
matrix A, and no read accesses are made to matrix B.
A
single-precision complex array of dimensions (lda, k). If side ==
'L' or 'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or
'u', the leading kk upper triangular part of the array A must contain
the upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading kk lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
and are assumed to be unity.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op(A)*B
B*op(A)
PG- 05326- 032_V02 215
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/ctrmm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasCt rsm( )
void
cublasCtrsm (char side, char uplo, char transa,
char diag, int m, int n, cuComplex alpha,
const cuComplex *A, int lda, cuComplex *B,
int ldb)
solvesoneofthematrixequations
alphaisasingleprecisioncomplexscalar,andXandBaremn
matricesthatconsistofsingleprecisioncomplexelements.Aisaunit
ornonunit,upperorlower,triangularmatrix.
TheresultmatrixXoverwritesinputmatrixB;thatis,onexittheresult
isstoredinB.MatricesAandBarestoredincolumnmajorformat,and
B
single-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B. It is overwritten
with the transformed matrix on exit.
ldb
leading dimension of B; ldb must be at least max(1, m).
Out put
B
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
or ,
where , ,or ;
op A ( ) X alpha B * = * X op A ( ) alpha B * = *
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
216 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ldaandldbaretheleadingdimensionsofthetwodimensionalarrays
thatcontainAandB,respectively.
I nput
side
specifies whether op(A) appears on the left or right of X:
side == 'L' or 'l' indicates solve ;
side == 'R' or 'r' indicates solve .
uplo
specifies whether the matrix A is an upper or lower triangular matrix:
uplo == 'U' or 'u' indicates A is an upper triangular matrix;
uplo == 'L' or 'l' indicates A is a lower triangular matrix.
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
m
specifies the number of rows of B; m must be at least zero.
n
specifies the number of columns of B; n must be at least zero.
alpha
single-precision complex scalar multiplier applied to B. When alpha is
zero, A is not referenced and B does not have to be a valid input.
A
single-precision complex array of dimensions (lda, k), where k is m
when side == 'L' or 'l' and is n when side == 'R' or 'r'. If
uplo == 'U' or 'u', the leading kk upper triangular part of the array
A must contain the upper triangular matrix, and the strictly lower
triangular matrix of A is not referenced. When uplo == 'L' or 'l',
the leading kk lower triangular part of the array A must contain the
lower triangular matrix, and the strictly upper triangular part of A is
not referenced. Note that when diag == 'U' or 'u', the diagonal
elements of A are not referenced and are assumed to be unity.
lda
leading dimension of the two-dimensional array containing A.
When side == 'L' or 'l', lda must be at least max(1, m).
When side == 'R' or 'r', lda must be at least max(1, n).
B
single-precision complex array of dimensions (ldb, n); ldb must be
at least max(1, m). The leading mn part of the array B must contain
the right-hand side matrix B. On exit, B is overwritten by the solution
matrix X.
ldb
leading dimension of the two-dimensional array containing B; ldb
must be at least max(1, m).
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 217
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/ctrsm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
B
contains the solution matrix X satisfying or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
218 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Double- Precision BLAS3 Funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
ThedoubleprecisionBLAS3functionsarelistedbelow:
FunctioncublasDgemm()onpage 219
FunctioncublasDsymm()onpage 220
FunctioncublasDsyrk()onpage 222
FunctioncublasDsyr2k()onpage 224
FunctioncublasDtrmm()onpage 226
FunctioncublasDtrsm()onpage 228
PG- 05326- 032_V02 219
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Funct ion cublasDgemm( )
void
cublasDgemm (char transa, char transb, int m, int n,
int k, double alpha, const double *A,
int lda, const double *B, int ldb,
double beta, double *C, int ldc)
computestheproductofmatrixAandmatrixB,multipliestheresult
byscalaralpha,andaddsthesumtotheproductofmatrixCand
scalarbeta.Itperformsoneofthematrixmatrixoperations:
andalphaandbetaaredoubleprecisionscalars.A,B,andCare
matricesconsistingofdoubleprecisionelements,withop(A)anmk
matrix,op(B)aknmatrix,andCanmnmatrix.MatricesA,B,andC
arestoredincolumnmajorformat,andlda,ldb,andldcarethe
leadingdimensionsofthetwodimensionalarrayscontainingA,B,
andC.

,
where or ,
I nput
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
transb
specifies op(B). If transb == 'N' or 'n', .
If transb == 'T', 't', 'C', or 'c', .
m
number of rows of matrix op(A) and rows of matrix C; m must be at
least zero.
n
number of columns of matrix op(B) and number of columns of C;
n must be at least zero.
k
number of columns of matrix op(A) and number of rows of op(B);
k must be at least zero.
alpha
double-precision scalar multiplier applied to .
A
double-precision array of dimensions (lda, k) if transa == 'N' or
'n', and of dimensions (lda, m) otherwise. If transa == 'N' or
'n', lda must be at least max(1, m); otherwise, lda must be at least
max(1, k).
lda
leading dimension of two-dimensional array used to store matrix A.
C alpha * op(A) * op(B) beta * C + =
op X ( ) X = op X ( ) X
T
=
op A ( ) A =
op A ( ) A
T
=
op B ( ) B =
op B ( ) B
T
=
op A ( ) op B ( ) *
220 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dgemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDsymm( )
void
cublasDsymm (char side, char uplo, int m, int n,
double alpha, const double *A, int lda,
const double *B, int ldb, double beta,
double *C, int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaredoubleprecisionscalars,Aisasymmetric
matrixconsistingofdoubleprecisionelementsandisstoredineither
B
double-precision array of dimensions (ldb, n) if transb == 'N' or
'n', and of dimensions (ldb, k) otherwise. If transb == 'N' or
'n', ldb must be at least max(1, k); otherwise, ldb must be at least
max(1, n).
ldb
leading dimension of two-dimensional array used to store matrix B.
beta
double-precision scalar multiplier applied to C. If zero, C does not have
to be a valid input.
C
double-precision array of dimensions (ldc, n); ldc must be at least
max (1, m).
ldc
leading dimension of two-dimensional array used to store matrix C.
Out put
C
updated based on .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha * op(A) * op(B) beta * C + =
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
PG- 05326- 032_V02 221
NVI DI A
CHAPTER 5 BLAS3 Funct ions
lowerorupperstoragemode.BandCaremnmatricesconsistingof
doubleprecisionelements.
I nput
side
specifies whether the symmetric matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the symmetric matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of symmetric matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of symmetric
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
double-precision scalar multiplier applied to A * B or B * A.
A
double-precision array of dimensions (lda, ka), where ka is m when
side == 'L' or 'l' and is n otherwise. If side == 'L' or 'l', the
leading mm part of array A must contain the symmetric matrix such
that when uplo == 'U' or 'u', the leading mm part stores the upper
triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced; and when uplo == 'L' or 'l',
the leading mm part stores the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced. If
side == 'R' or 'r', the leading nn part of array A must contain the
symmetric matrix such that when uplo == 'U' or 'u', the leading
nn part stores the upper triangular part of the symmetric matrix, and
the strictly lower triangular part of A is not referenced; and when
uplo == 'L' or 'l', the leading nn part stores the lower triangular
part of the symmetric matrix, and the strictly upper triangular part is
not referenced.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
222 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsymm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDsyrk( )
void
cublasDsyrk (char uplo, char trans, int n, int k,
double alpha, const double *A, int lda,
double beta, double *C, int ldc)
performsoneofthesymmetricrankkoperations
wherealphaandbetaaredoubleprecisionscalars.Cisannn
symmetricmatrixconsistingofdoubleprecisionelementsandis
storedineitherlowerorupperstoragemode.Aisamatrixconsisting
ofdoubleprecisionelementswithdimensionsofnkinthefirstcase
andkninthesecondcase.
B
double-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
double-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C
double-precision array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
or , C alpha A A
T
beta C * + * * = C alpha A
T
A * beta C * + * =
PG- 05326- 032_V02 223
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
double-precision scalar multiplier applied to or .
A
double-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A contains the matrix A; otherwise,
the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
double-precision scalar multiplier applied to C.
If beta is zero, C is not read.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
224 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyrk.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDsyr2k( )
void
cublasDsyr2k (char uplo, char trans, int n, int k,
double alpha, const double *A, int lda,
const double *B, int ldb, double beta,
double *C, int ldc)
performsoneofthesymmetricrank2koperations
C
double-precision array of dimensions (ldc, n). If uplo == 'U' or
'u', the leading nn triangular part of the array C must contain the
upper triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 225
NVI DI A
CHAPTER 5 BLAS3 Funct ions
wherealphaandbetaaredoubleprecisionscalars.Cisannn
symmetricmatrixconsistingofdoubleprecisionelementsandis
storedineitherlowerorupperstoragemode.AandBarematrices
consistingofdoubleprecisionelementswithdimensionofnkinthe
firstcaseandkninthesecondcase.
I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n'', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of matrix
A. If trans == 'T', 't', 'C', or 'c', k specifies the number of rows
of matrix A; k must be at least zero.
alpha
double-precision scalar multiplier.
A
double-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A must contain the matrix A,
otherwise the leading kn part of the array must contain the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
double-precision array of dimensions (ldb, kb), where kb = k when
trans == 'N' or 'n', and k = n otherwise. When trans == 'N' or
'n', the leading nk part of array B must contain the matrix B,
otherwise the leading kn part of the array must contain the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
double-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
226 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyr2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt rmm( )
void
cublasDtrmm (char side, char uplo, char transa,
char diag, int m, int n, double alpha,
const double *A, int lda, const double *B,
int ldb)
performsoneofthematrixmatrixoperations
C
double-precision array of dimensions (ldc, n). If uplo == 'U' or
'u', the leading nn triangular part of the array C must contain the
upper triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; idc must be at least max(1, n).
Out put
C
updated according to
or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where or ,
B alpha op A ( ) B * * = B alpha B op * A ( ) * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 227
NVI DI A
CHAPTER 5 BLAS3 Funct ions
alphaisadoubleprecisionscalar,Bisanmnmatrixconsistingof
doubleprecisionelements,andAisaunitornonunit,upperorlower
triangularmatrixconsistingofdoubleprecisionelements.
MatricesAandBarestoredincolumnmajorformat,andldaandldb
aretheleadingdimensionsofthetwodimensionalarraysthatcontain
AandB,respectively.
I nput
side
specifies whether op(A) multiplies B from the left or right.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
transa
specifies the form of op(A) to be used in the matrix multiplication.
If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
m
the number of rows of matrix B; m must be at least zero.
n
the number of columns of matrix B; n must be at least zero.
alpha
double-precision scalar multiplier applied to or ,
respectively. If alpha is zero, no accesses are made to matrix A, and
no read accesses are made to matrix B.
A
double-precision array of dimensions (lda, k). If side == 'L' or
'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or 'u', the
leading kk upper triangular part of the array A must contain the
upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading kk lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
and are assumed to be unity.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
op A ( ) A =
op A ( ) A
T
=
op(A)*B B*op(A)
228 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtrmm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasDt rsm( )
void
cublasDtrsm (char side, char uplo, char transa,
char diag, int m, int n, double alpha,
const double *A, int lda, double *B,
int ldb)
solvesoneofthematrixequations
alphaisadoubleprecisionscalar,andXandBaremnmatricesthat
consistofdoubleprecisionelements.Aisaunitornonunit,upperor
lower,triangularmatrix.
TheresultmatrixXoverwritesinputmatrixB;thatis,onexittheresult
isstoredinB.MatricesAandBarestoredincolumnmajorformat,and
ldaandldbaretheleadingdimensionsofthetwodimensionalarrays
thatcontainAandB,respectively.
B
double-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B. It is overwritten with the
transformed matrix on exit.
ldb
leading dimension of B; ldb must be at least max(1, m).
Out put
B
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
or ,
where or ,
op A ( ) X alpha B * = * X op A ( ) alpha B * = *
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 229
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
side
specifies whether op(A) appears on the left or right of X:
side == 'L' or 'l' indicates solve ;
side == 'R' or 'r' indicates solve .
uplo
specifies whether the matrix A is an upper or lower triangular matrix:
uplo == 'U' or 'u' indicates A is an upper triangular matrix;
uplo == 'L' or 'l' indicates A is a lower triangular matrix.
transa
specifies the form of op(A) to be used in matrix multiplication.
If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
m
specifies the number of rows of B; m must be at least zero.
n
specifies the number of columns of B; n must be at least zero.
alpha
double-precision scalar multiplier applied to B. When alpha is zero, A
is not referenced and B does not have to be a valid input.
A
double-precision array of dimensions (lda, k), where k is m when
side == 'L' or 'l' and is n when side == 'R' or 'r'. If uplo ==
'U' or 'u', the leading kk upper triangular part of the array A must
contain the upper triangular matrix, and the strictly lower triangular
matrix of A is not referenced. When uplo == 'L' or 'l', the leading
kk lower triangular part of the array A must contain the lower
triangular matrix, and the strictly upper triangular part of A is not
referenced. Note that when diag == 'U' or 'u', the diagonal
elements of A are not referenced and are assumed to be unity.
lda
leading dimension of the two-dimensional array containing A.
When side == 'L' or 'l', lda must be at least max(1, m).
When side == 'R' or 'r', lda must be at least max(1, n).
B
double-precision array of dimensions (ldb, n); ldb must be at least
max(1, m). The leading mn part of the array B must contain the right-
hand side matrix B. On exit, B is overwritten by the solution matrix X.
ldb
leading dimension of the two-dimensional array containing B; ldb
must be at least max(1, m).
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
op A ( ) A =
op A ( ) A
T
=
230 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtrsm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
B
contains the solution matrix X satisfying or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
PG- 05326- 032_V02 231
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Double- Precision Complex BLAS3 Funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
FivedoubleprecisioncomplexBLAS3functionsareimplemented:
FunctioncublasZgemm()onpage 232
FunctioncublasZhemm()onpage 233
FunctioncublasZherk()onpage 235
FunctioncublasZher2k()onpage 238
FunctioncublasZsymm()onpage 240
FunctioncublasZsyrk()onpage 242
FunctioncublasZsyr2k()onpage 244
FunctioncublasZtrmm()onpage 246
FunctioncublasZtrsm()onpage 248
232 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZgemm( )
void
cublasZgemm (char transa, char transb, int m, int n,
int k, cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex beta, cuDoubleComplex *C,
int ldc)
performsoneofthematrixmatrixoperations
andalphaandbetaaredoubleprecisioncomplexscalars.A,B,andC
arematricesconsistingofdoubleprecisioncomplexelements,with
op(A)anmkmatrix,op(B)aknmatrix,andCanmnmatrix.
,
where , ,or ;
I nput
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
transb
specifies op(B). If transb == 'N' or 'n', .
If transb == 'T' or 't', .
If transb == 'C' or 'c', .
m
number of rows of matrix op(A) and rows of matrix C;
m must be at least zero.
n
number of columns of matrix op(B) and number of columns of C;
n must be at least zero.
k
number of columns of matrix op(A) and number of rows of op(B);
k must be at least zero.
alpha
double-precision complex scalar multiplier applied to .
A
double-precision complex array of dimension (lda, k) if transa ==
'N' or 'n', and of dimension (lda, m) otherwise.
lda
leading dimension of A. When transa == 'N' or 'n', it must be at
least max(1, m) and at least max(1, k) otherwise.
B
double-precision complex array of dimension (ldb, n) if transb ==
'N' or 'n', and of dimension (ldb, k) otherwise.
C alpha op A ( ) op B ( ) beta C * + * * =
op X ( ) X = op X ( ) X
T
= op X ( ) X
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op B ( ) B =
op B ( ) B
T
=
op B ( ) B
H
=
op(A)*op(B)
PG- 05326- 032_V02 233
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zgemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZhemm( )
void
cublasZhemm (char side, char uplo, int m, int n,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex beta, cuDoubleComplex *C,
int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaredoubleprecisioncomplexscalars,Aisa
Hermitianmatrixconsistingofdoubleprecisioncomplexelements
ldb
leading dimension of B. When transb == 'N' or 'n', it must be at
least max(1, k) and at least max(1, n) otherwise.
beta
double-precision complex scalar multiplier applied to C. If beta is
zero, C does not have to be a valid input.
C
double-precision array of dimensions (ldc, n).
ldc
leading dimension of C; idc must be at least max(1, m).
Out put
C
updated according to .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha op A ( ) op B ( ) beta C * + * * =
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
234 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
andisstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofdoubleprecisioncomplexelements.
I nput
side
specifies whether the Hermitian matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the Hermitian matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of Hermitian matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of Hermitian
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to A * B or B * A.
A
double-precision complex array of dimensions (lda, ka), where ka is
m when side == 'L' or 'l' and is n otherwise. If side == 'L' or
'l', the leading mm part of array A must contain the Hermitian
matrix such that when uplo == 'U' or 'u', the leading mm part
stores the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced; and when uplo ==
'L' or 'l', the leading mm part stores the lower triangular part of the
Hermitian matrix, and the strictly upper triangular part is not
referenced. If side == 'R' or 'r', the leading nn part of array A
must contain the Hermitian matrix such that when uplo == 'U' or
'u', the leading nn part stores the upper triangular part of the
Hermitian matrix, and the strictly lower triangular part of A is not
referenced; and when uplo == 'L' or 'l', the leading nn part stores
the lower triangular part of the Hermitian matrix, and the strictly
upper triangular part is not referenced. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
PG- 05326- 032_V02 235
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zhemm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZherk( )
void
cublasZherk (char uplo, char trans, int n, int k,
double alpha, const cuDoubleComplex *A,
int lda, double beta, cuDoubleComplex *C,
int ldc)
performsoneoftheHermitianrankkoperations
wherealphaandbetaaredoubleprecisionscalars.Cisannn
Hermitianmatrixconsistingofdoubleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.Aisamatrix
B
double-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
double-precision complex scalar multiplier applied to C. If beta is
zero, C does not have to be a valid input.
C
double-precision complex array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
or , C alpha A A
H
beta C * + * * = C alpha A
H
A * beta C * + * =
236 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
consistingofdoubleprecisioncomplexelementswithdimensionsof
nkinthefirstcaseandkninthesecondcase.
I nput
uplo
specifies whether the Hermitian matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
double-precision scalar multiplier applied to or .
A
double-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
double-precision scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
A A
H
* A
H
A *
PG- 05326- 032_V02 237
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zherk.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
C
double-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the Hermitian matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
Hermitian matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero, and on
exit they are set to zero.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
I nput ( cont inued)
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
238 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZher2k( )
void
cublasZher2k (char uplo, char trans, int n, int k,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
double beta, cuDoubleComplex *C, int ldc)
performsoneoftheHermitianrank2koperations
wherealphaisadoubleprecisioncomplexscalarandbetaisa
doubleprecisionrealscalar.CisannnHermitianmatrixconsistingof
doubleprecisioncomplexelementsandisstoredineitherloweror
upperstoragemode.AandBarematricesconsistingofdouble
precisioncomplexelementswithdimensionsofnkinthefirstcase
andkninthesecondcase.

or
,
I nput
uplo
specifies whether the Hermitian matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the Hermitian matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
Hermitian matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
double-precision complex multiplier.
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
PG- 05326- 032_V02 239
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zher2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
A
double-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A contains the matrix A; otherwise,
the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
double-precision array of dimensions (ldb, kb), where kb is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array B contains the matrix B; otherwise,
the leading kn part of the array contains the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
double-precision real scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C
double-precision array of dimensions (ldc, n). If uplo == 'U' or
'u', the leading nn triangular part of the array C must contain the
upper triangular part of the Hermitian matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the Hermitian matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix. The imaginary parts of the diagonal elements
need not be set; they are assumed to be zero, and on exit they are set
to zero.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to
or .
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
I nput ( cont inued)
C alpha A B
H
alpha * B*A
H
+ beta*C + * * =
C alpha A
H
B alpha *B
H
*A beta *C + + * * =
240 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZsymm( )
void
cublasZsymm (char side, char uplo, int m, int n,
cuDoubleComplex alpha,
const cuDoubleComplex *A,
int lda, const cuDoubleComplex *B, int ldb,
cuDoubleComplex beta, cuDoubleComplex *C,
int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaredoubleprecisioncomplexscalars,Aisa
symmetricmatrixconsistingofdoubleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofdoubleprecisioncomplexelements.

CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
Error St at us ( cont inued)
or ,
I nput
side
specifies whether the symmetric matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the symmetric matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of symmetric matrix A when
side == 'L' or 'l'; m must be at least zero.
C alpha A B beta C * + * * = C alpha B A beta C * + * * =
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
PG- 05326- 032_V02 241
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zsymm.f
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of symmetric
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to A * B or B * A.
A
double-precision complex array of dimensions (lda, ka), where ka is
m when side == 'L' or 'l' and is n otherwise. If side == 'L' or
'l', the leading mm part of array A must contain the symmetric
matrix such that when uplo == 'U' or 'u', the leading mm part
stores the upper triangular part of the symmetric matrix, and the
strictly lower triangular part of A is not referenced; and when uplo ==
'L' or 'l', the leading mm part stores the lower triangular part of the
symmetric matrix, and the strictly upper triangular part is not
referenced. If side == 'R' or 'r', the leading nn part of array A
must contain the symmetric matrix such that when uplo == 'U' or
'u', the leading nn part stores the upper triangular part of the
symmetric matrix, and the strictly lower triangular part of A is not
referenced; and when uplo == 'L' or 'l', the leading nn part stores
the lower triangular part of the symmetric matrix, and the strictly
upper triangular part is not referenced.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B
double-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
double-precision complex scalar multiplier applied to C. If beta is
zero, C does not have to be a valid input.
C
double-precision complex array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
I nput ( cont inued)
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
242 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZsyrk( )
void
cublasZsyrk (char uplo, char trans, int n, int k,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
cuDoubleComplex beta,
cuDoubleComplex *C, int ldc)
performsoneofthesymmetricrankkoperations
wherealphaandbetaaredoubleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofdoubleprecisioncomplex
elementsandisstoredineitherlowerorupperstoragemode.Aisa
matrixconsistingofdoubleprecisioncomplexelementswith
dimensionsofnkinthefirstcaseandkninthesecondcase.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
or ,
I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
C alpha A A
T
beta C * + * * = C alpha A
T
A * beta C * + * =
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
PG- 05326- 032_V02 243
NVI DI A
CHAPTER 5 BLAS3 Funct ions
Reference:http://www.netlib.org/blas/zsyrk.f
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
double-precision complex scalar multiplier applied to or
.
A
double-precision complex array of dimensions (lda, ka), where ka is
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
double-precision complex scalar multiplier applied to C.
If beta is zero, C is not read.
C
double-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the symmetric matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
symmetric matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
updated according to or
.
I nput ( cont inued)
A A
T
*
A
T
A *
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
244 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZsyr2k( )
void
cublasZsyr2k (char uplo, char trans, int n, int k,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex beta,
cuDoubleComplex *C, int ldc)
performsoneofthesymmetricrank2koperations
wherealphaandbetaaredoubleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofdoubleprecisioncomplex
elementsandisstoredineitherlowerorupperstoragemode.AandB
arematricesconsistingofdoubleprecisioncomplexelementswith
dimensionofnkinthefirstcaseandkninthesecondcase.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if n < 0 or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
or
,
I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T',
't', 'C', or 'c', .
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 245
NVI DI A
CHAPTER 5 BLAS3 Funct ions
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n'', n specifies the number of rows of matrices A
and B. If trans == 'T', 't', 'C', or 'c', n specifies the number of
columns of matrices A and B; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrices A and B. If trans == 'T', 't', 'C', or 'c', k specifies the
number of rows of matrices A and B; k must be at least zero.
alpha
double-precision scalar multiplier.
A
double-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A must contain the matrix A,
otherwise the leading kn part of the array must contain the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
B
double-precision array of dimensions (ldb, kb), where kb = k when
trans == 'N' or 'n', and k = n otherwise. When trans == 'N' or
'n', the leading nk part of array B must contain the matrix B,
otherwise the leading kn part of the array must contain the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
double-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C
double-precision array of dimensions (ldc, n). If uplo == 'U' or
'u', the leading nn triangular part of the array C must contain the
upper triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; idc must be at least max(1, n).
I nput ( cont inued)
246 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zsyr2k.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt rmm( )
void
cublasZtrmm (char side, char uplo, char transa,
char diag, int m, int n,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
const cuDoubleComplex *B, int ldb)
performsoneofthematrixmatrixoperations
alphaisadoubleprecisioncomplexscalar;Bisanmnmatrix
consistingofdoubleprecisioncomplexelements;andAisaunitor
nonunit,upperorlowertriangularmatrixconsistingofdouble
precisioncomplexelements.
MatricesAandBarestoredincolumnmajorformat,andldaandldb
aretheleadingdimensionsofthetwodimensionalarraysthatcontain
AandB,respectively.
Out put
C
updated according to
or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0, n < 0, or k < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where , ,or ;
B alpha op A ( ) B * * = B alpha B op * A ( ) * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 247
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
side
specifies whether op(A) multiplies B from the left or right.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
m
the number of rows of matrix B; m must be at least zero.
n
the number of columns of matrix B; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to or
, respectively. If alpha is zero, no accesses are made to
matrix A, and no read accesses are made to matrix B.
A
double-precision complex array of dimensions (lda, k). If side ==
'L' or 'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or
'u', the leading kk upper triangular part of the array A must contain
the upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading kk lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
and are assumed to be unity.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
B
double-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B. It is overwritten
with the transformed matrix on exit.
ldb
leading dimension of B; ldb must be at least max(1, m).
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op(A)*B
B*op(A)
248 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztrmm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasZt rsm( )
void
cublasZtrsm (char side, char uplo, char transa,
char diag, int m, int n,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
cuDoubleComplex *B, int ldb)
solvesoneofthematrixequations
alphaisadoubleprecisioncomplexscalar,andXandBaremn
matricesthatconsistofdoubleprecisioncomplexelements.Aisaunit
ornonunit,upperorlower,triangularmatrix.
TheresultmatrixXoverwritesinputmatrixB;thatis,onexittheresult
isstoredinB.MatricesAandBarestoredincolumnmajorformat,and
ldaandldbaretheleadingdimensionsofthetwodimensionalarrays
thatcontainAandB,respectively.
Out put
B
updated according to or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
or ,
where , ,or ;
op A ( ) X alpha B * = * X op A ( ) alpha B * = *
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 249
NVI DI A
CHAPTER 5 BLAS3 Funct ions

I nput
side
specifies whether op(A) appears on the left or right of X:
side == 'L' or 'l' indicates solve ;
side == 'R' or 'r' indicates solve .
uplo
specifies whether the matrix A is an upper or lower triangular matrix:
uplo == 'U' or 'u' indicates A is an upper triangular matrix;
uplo == 'L' or 'l' indicates A is a lower triangular matrix.
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
diag
specifies whether or not A is a unit triangular matrix.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
m
specifies the number of rows of B; m must be at least zero.
n
specifies the number of columns of B; n must be at least zero.
alpha
double-precision complex scalar multiplier applied to B. When alpha
is zero, A is not referenced and B does not have to be a valid input.
A
double-precision complex array of dimensions (lda, k), where k is m
when side == 'L' or 'l' and is n when side == 'R' or 'r'. If
uplo == 'U' or 'u', the leading kk upper triangular part of the array
A must contain the upper triangular matrix, and the strictly lower
triangular matrix of A is not referenced. When uplo == 'L' or 'l',
the leading kk lower triangular part of the array A must contain the
lower triangular matrix, and the strictly upper triangular part of A is
not referenced. Note that when diag == 'U' or 'u', the diagonal
elements of A are not referenced and are assumed to be unity.
lda
leading dimension of the two-dimensional array containing A.
When side == 'L' or 'l', lda must be at least max(1, m).
When side == 'R' or 'r', lda must be at least max(1, n).
B
double-precision complex array of dimensions (ldb, n); ldb must be
at least max(1, m). The leading mn part of the array B must contain
the right-hand side matrix B. On exit, B is overwritten by the solution
matrix X.
ldb
leading dimension of the two-dimensional array containing B; ldb
must be at least max(1, m).
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
250 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztrsm.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Out put
B
contains the solution matrix X satisfying or
.
Error St at us
CUBLAS_STATUS_NOT_INITIALIZED
if CUBLAS library was not initialized
CUBLAS_STATUS_INVALID_VALUE
if m < 0 or n < 0
CUBLAS_STATUS_ARCH_MISMATCH
if function invoked on device that
does not support double precision
CUBLAS_STATUS_EXECUTION_FAILED
if function failed to launch on GPU
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
PG- 05326- 032_V02 251
NVI DI A
A P P E N D I X
A
CUBLAS For t r an Bi ndi ngs
CUBLAisimplementedusingtheCbasedCUDAtoolchainandthus
providesaCstyleAPI.Thismakesinterfacingtoapplicationswritten
inCorC++trivial.Inaddition,therearemanyapplications
implementedinFortranthatwouldbenefitfromusingCUBLAS.
CUBLASuses1basedindexingandFortranstylecolumnmajor
storageformultidimensionaldatatosimplifyinterfacingtoFortran
applications.Unfortunately,FortrantoCcallingconventionsarenot
standardizedanddifferbyplatformandtoolchain.Inparticular,
differencesmayexistinthefollowingareas:
symbolnames(capitalization,namedecoration)
argumentpassing(byvalueorreference)
passingofstringarguments(lengthinformation)
passingofpointerarguments(sizeofthepointer)
returningfloatingpointorcompounddatatypes(forexample,
singleprecisionorcomplexdatatypes)
Toprovidemaximumflexibilityinaddressingthosedifferences,the
CUBLASFortraninterfaceisprovidedintheformofwrapper
functions,whicharewritteninCandprovidedintwoforms:
thethunkingwrapperinterfaceinthefilefortran_thunking.c
thedirectwrapperinterfaceinthefilefortran.c
252 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Thecodeofoneofthosetwofilesmustbecompiledintoanapplication
forittocalltheCUBLASAPIfunctions.Providingsourcecodeallows
userstomakeanychangesnecessaryforaparticularplatformand
toolchain.
ThecodeinthosetwoCfileshasbeenusedtodemonstrate
interoperabilitywiththecompilersg773.2.3andg950.91on32bit
Linux,g773.4.5andg950.91on64bitLinux,IntelFortran9.0andIntel
Fortran10.0on32bitand64bitMicrosoftWindowsXP,andg773.4.0
andg950.92onMacOSX.
Notethatforg77,useofthecompilerflag-fno-second-underscore
isrequiredtousethesewrappersasprovided.Also,theuseofthe
defaultcallingconventionswithregardtoargumentandreturnvalue
passingisexpected.Usingtheflag-fno-f2cchangesthedefault
callingconventionwithrespecttothesetwoitems.
ThethunkingwrappersallowinterfacingtoexistingFortran
applicationswithoutanychangestotheapplication.Duringeachcall,
thewrappersallocateGPUmemory,copysourcedatafromCPU
memoryspacetoGPUmemoryspace,callCUBLAS,andfinallycopy
backtheresultstoCPUmemoryspaceanddeallocatetheGPU
memory.Asthisprocesscausesverysignificantcalloverhead,these
wrappersareintendedforlighttesting,notforproductioncode.To
usethethunkingwrappers,theapplicationneedstobecompiledwith
thefilefortran_thunking.c.
Thedirectwrappers,intendedforproductioncode,substitutedevice
pointersforvectorandmatrixargumentsinallBLASfunctions.Touse
theseinterfaces,existingapplicationsmustbemodifiedslightlyto
allocateanddeallocatedatastructuresinGPUmemoryspace(using
CUBLAS_ALLOCandCUBLAS_FREE)andtocopydatabetweenGPUand
CPUmemoryspace(withCUBLAS_SET_VECTOR,CUBLAS_GET_VECTOR,
CUBLAS_SET_MATRIX,andCUBLAS_GET_MATRIX).Thesample
wrappersprovidedinfortran.cmapdevicepointerstothe
operatingsystemdependenttypesize_t,whichis32bitswideon32
bitplatformsand64bitswideona64bitplatforms.
Oneapproachtodealingwithindexarithmeticondevicepointersin
FortrancodeistouseCstylemacrosandtousetheCpreprocessorto
expandthese,asshownbelowinExample A.2.OnLinuxandMacOS
X,preprocessingcanbedoneusingthe'-E -x f77-cpp-input'
PG- 05326- 032_V02 253
NVI DI A
CHAPTER A CUBLAS Fort ran Bindings
optionwiththeg77compiler,orsimplyusingthe'-cpp'optionwith
g95orgfortran.OnWindowsplatformswithMicrosoftVisualC/C++,
using'cl-EP'achievessimilarresults.
WhentraditionalfixedformFortran77codeisportedtoCUBLAS,line
lengthoftenincreaseswhentheBLAScallsareexchangedforCUBLAS
calls.Longerfunctionnamesandpossiblemacroexpansionare
contributingfactors.Inadvertentlyexceedingthemaximumline
lengthcanleadtoruntimeerrorsthataredifficulttofind,socare
shouldbetakennottoexceedthe72columnlimitiffixedformis
retained.
Thefollowingtwoexamplesshowasmallapplicationimplementedin
Fortran77onthehost(Example A.1.,Fortran77Application
ExecutingontheHostonpage 254),andshowthesameapplication
usingthenonthunkingwrappersafterithasbeenportedtouse
CUBLAS(Example A.2.,Fortran77ApplicationPortedtoUse
CUBLASonpage 254).
ThesecondexampleshouldbecompiledwithARCH_64definedas1on
64bitoperatingsystemsandas0on32bitoperatingsystems.For
example,ong95orgfortranitcanbedonedirectlyonthecommand
linebyusingtheoption'-cpp -DARCH_64=1'.
254 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library

Example A. 1. Fort ran 77 Applicat ion Execut ing on t he Host


subroutine modify (m, ldm, n, p, q, alpha, beta)
implicit none
integer ldm, n, p, q
real*4 m(ldm,*), alpha, beta
external sscal
call sscal (n-p+1, alpha, m(p,q), ldm)
call sscal (ldm-p+1, beta, m(p,q), 1)
return
end

program matrixmod
implicit none
integer M, N
parameter (M=6, N=5)
real*4 a(M,N)
integer i, j
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
enddo
enddo
call modify (a, M, N, 2, 3, 16.0, 12.0)
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
subroutine modify (devPtrM, ldm, n, p, q, alpha, beta)
PG- 05326- 032_V02 255
NVI DI A
CHAPTER A CUBLAS Fort ran Bindings
implicit none
integer sizeof_real
parameter (sizeof_real=4)
integer ldm, n, p, q
#if ARCH_64
integer*8 devPtrM
#else
integer devPtrM
#endif
real*4 alpha, beta
call cublas_sscal (n-p+1, alpha,
1 devPtrM+IDX2F(p,q,ldm)*sizeof_real,
2 ldm)
call cublas_sscal (ldm-p+1, beta,
1 devPtrM+IDX2F(p,q,ldm)*sizeof_real,
2 1)
return
end
program matrixmod
implicit none
integer M, N, sizeof_real
#if ARCH_64
integer*8 devPtrA
#else
integer devPtrA
#endif
parameter (M=6, N=5, sizeof_real=4)
real*4 a(M,N)
integer i, j, stat
external cublas_init, cublas_set_matrix, cublas_get_matrix
external cublas_shutdown, cublas_alloc
integer cublas_alloc, cublas_set_matrix, cublas_get_matrix
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS ( cont inued)
256 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
enddo
enddo

call cublas_init
stat = cublas_alloc(M*N, sizeof_real, devPtrA)
if (stat .NE. 0) then
write(*,*) "device memory allocation failed"
call cublas_shutdown
stop
endif
stat = cublas_set_matrix (M, N, sizeof_real, a, M, devPtrA, M)
if (stat .NE. 0) then
call cublas_free (devPtrA)
write(*,*) "data download failed"
call cublas_shutdown
stop
endif
call modify (devPtrA, M, N, 2, 3, 16.0, 12.0)
stat = cublas_get_matrix (M, N, sizeof_real, devPtrA, M, a, M)
if (stat .NE. 0) then
call cublas_free (devPtrA)
write(*,*) "data upload failed"
call cublas_shutdown
stop
endif
call cublas_free (devPtrA)
call cublas_shutdown
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS ( cont inued)

You might also like