Cublas Library

CUDA
CUBLAS Li br ar y
PG- 05326- 032_V02
August , 2010
NVI DI A Corporat ion
CUBLAS Library PG-05326-032_V02
Publishedby
NVIDIACorporation
2701SanTomasExpressway
SantaClara,CA95050
Notice
ALLNVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,
LISTS,ANDOTHERDOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEING
PROVIDEDASIS.NVIDIAMAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OR
OTHERWISEWITHRESPECTTOTHEMATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIED
WARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORAPARTICULAR
PURPOSE.
Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno
responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorother
rightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationorotherwiseunder
anypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationare
subjecttochangewithoutnotice.Thispublicationsupersedesandreplacesallinformationpreviously
supplied.NVIDIACorporationproductsarenotauthorizedforuseascriticalcomponentsinlifesupport
devicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.
Trademarks
NVIDIA,CUDA,andtheNVIDIAlogoaretrademarksorregisteredtrademarksofNVIDIACorporation
intheUnitedStatesandothercountries.Othercompanyandproductnamesmaybetrademarksofthe
respectivecompanieswithwhichtheyareassociated.
Copyright
20052010byNVIDIACorporation.Allrightsreserved.
PortionsoftheSGEMM,DGEMM,CGEMM,andZGEMMlibraryroutineswerewrittenbyVasilyVolkov
andaresubjecttotheModifiedBerkeleySoftwareDistributionLicenseasfollows:
Copyright(c)20072009,RegentsoftheUniversityofCalifornia
Allrightsreserved.
Redistributionanduseinsourceandbinaryforms,withorwithoutmodification,arepermittedprovided
thatthefollowingconditionsaremet:
Redistributionsofsourcecodemustretaintheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimer.
Redistributionsinbinaryformmustreproducetheabovecopyrightnotice,thislistofconditionsandthe
followingdisclaimerinthedocumentationand/orothermaterialsprovidedwiththedistribution.
NeitherthenameoftheUniversityofCalifornia,Berkeleynorthenamesofitscontributorsmaybeused
toendorseorpromoteproductsderivedfromthissoftwarewithoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHEAUTHORASISANDANYEXPRESSORIMPLIED
WARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIEDWARRANTIESOF
MERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEAREDISCLAIMED.INNO
EVENTSHALLTHEAUTHORBELIABLEFORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,
EXEMPLARY,ORCONSEQUENTIALDAMAGES(INCLUDING,BUTNOTLIMITEDTO,
PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSSOFUSE,DATA,ORPROFITS;OR
BUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHER
INCONTRACT,STRICTLIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISING
INANYWAYOUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF
SUCHDAMAGE.
PortionsoftheSGEMM,DGEMMandZGEMMlibraryroutineswerewrittenbyDavideBarbieriandare
subjecttotheModifiedBerkeleySoftwareDistributionLicenseasfollows:
Copyright(c)20082009DavideBarbieri@UniversityofRomeTorVergata.
Allrightsreserved.
followingdisclaimerinthedocumentationand/orothermaterialsprovidedwiththedistribution.
Thenameoftheauthormaynotbeusedtoendorseorpromoteproductsderivedfromthissoftware
withoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHEAUTHORASISANDANYEXPRESSORIMPLIED
WARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIEDWARRANTIESOF
MERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEAREDISCLAIMED.INNO
EVENTSHALLTHEAUTHORBELIABLEFORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,
EXEMPLARY,ORCONSEQUENTIALDAMAGES(INCLUDING,BUTNOTLIMITEDTO,
PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSSOFUSE,DATA,ORPROFITS;OR
BUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHER
INCONTRACT,STRICTLIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISING
INANYWAYOUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF
SUCHDAMAGE.
PortionsoftheDGEMMandSGEMMlibraryroutinesoptimizedfortheFermiarchitecturewere
developedbytheUniversityofTennessee.Subsequently,severalotherroutinesthatareoptimizedforthe
FermiarchitecturehavebeenderivedfromtheseinitialDGEMMandSGEMMimplementations.Those
portionsofthesourcecodearethussubjecttotheModifiedBerkeleySoftwareDistributionLicenseas
follows:
Copyright(c)2010TheUniversityofTennessee.Allrightsreserved.
followingdisclaimerlistedinthislicenseinthedocumentationand/orothermaterialsprovidedwith
thedistribution.
Neitherthenameofthecopyrightholdersnorthenamesofitscontributorsmaybeusedtoendorseor
promoteproductsderivedfromthissoftwarewithoutspecificpriorwrittenpermission.
THISSOFTWAREISPROVIDEDBYTHECOPYRIGHTHOLDERSANDCONTRIBUTORSASISAND
ANYEXPRESSORIMPLIEDWARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THEIMPLIED
WARRANTIESOFMERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSEARE
DISCLAIMED.INNOEVENTSHALLTHECOPYRIGHTOWNERORCONTRIBUTORSBELIABLEFOR
ANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,EXEMPLARY,ORCONSEQUENTIALDAMAGES
(INCLUDING,BUTNOTLIMITEDTO,PROCUREMENTOFSUBSTITUTEGOODSORSERVICES;LOSS
OFUSE,DATA,ORPROFITS;ORBUSINESSINTERRUPTION)HOWEVERCAUSEDANDONANY
THEORYOFLIABILITY,WHETHERINCONTRACT,STRICTLIABILITY,ORTORT(INCLUDING
NEGLIGENCEOROTHERWISE)ARISINGINANYWAYOUTOFTHEUSEOFTHISSOFTWARE,
EVENIFADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGE.
PG-05326-032_V02 v
NVI DI A
Tabl e of Cont ent s
1. The CUBLAS Li br ar y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
CUBLAS Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Type cublasSt at us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CUBLAS Helper Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Funct ion cublasI nit () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Funct ion cublasShut down() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasGet Error() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasAlloc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Funct ion cublasFree() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Funct ion cublasSet Vect or() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Funct ion cublasGet Vect or() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Funct ion cublasSet Mat rix() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Funct ion cublasGet Mat rix() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Funct ion cublasSet KernelSt ream() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Funct ion cublasSet Vect orAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Funct ion cublasGet Vect orAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Funct ion cublasSet Mat rixAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Funct ion cublasGet Mat rixAsync() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2. BLAS1 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Single-Precision BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Funct ion cublasI samax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Funct ion cublasI samin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Funct ion cublasSasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Funct ion cublasSaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Funct ion cublasScopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Funct ion cublasSdot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Funct ion cublasSnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Funct ion cublasSrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Funct ion cublasSrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Funct ion cublasSrot m() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Funct ion cublasSrot mg() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Funct ion cublasSscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Funct ion cublasSswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Single-Precision Complex BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
Funct ion cublasCcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
Funct ion cublasCdot c() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44
Funct ion cublasCdot u() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
Funct ion cublasCrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
Funct ion cublasCrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
Funct ion cublasCscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Funct ion cublasCsrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
Funct ion cublasCsscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
Funct ion cublasCswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
Funct ion cublasI camax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
Funct ion cublasI camin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Funct ion cublasScasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
Funct ion cublasScnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53
Double-Precision BLAS1 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
Funct ion cublasI damax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
Funct ion cublasI damin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
Funct ion cublasDasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
Funct ion cublasDaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
Funct ion cublasDcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
Funct ion cublasDdot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
Funct ion cublasDnrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
Funct ion cublasDrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62
Funct ion cublasDrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63
Funct ion cublasDrot m() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
Funct ion cublasDrot mg() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
Funct ion cublasDscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66
Funct ion cublasDswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
Double-Precision Complex BLAS1 funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
Funct ion cublasDzasum() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
Funct ion cublasDznrm2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Funct ion cublasI zamax() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
Funct ion cublasI zamin() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
Funct ion cublasZaxpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
Funct ion cublasZcopy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
Funct ion cublasZdot c() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75
Funct ion cublasZdot u() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
Funct ion cublasZdrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
Funct ion cublasZdscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78
Funct ion cublasZrot () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
Funct ion cublasZrot g() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Funct ion cublasZscal() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
Funct ion cublasZswap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
PG-05326-032_V02 vi i
NVI DI A
CUDA CUBLAS Library
3. Si ngl e-Pr eci si on BLAS2 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Single-Precision BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Funct ion cublasSgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Funct ion cublasSgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Funct ion cublasSger() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Funct ion cublasSsbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Funct ion cublasSspmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Funct ion cublasSspr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Funct ion cublasSspr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Funct ion cublasSsymv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Funct ion cublasSsyr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Funct ion cublasSsyr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Funct ion cublasSt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Funct ion cublasSt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Funct ion cublasSt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Funct ion cublasSt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Funct ion cublasSt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Funct ion cublasSt rsv(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Single-Precision Complex BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Funct ion cublasCgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Funct ion cublasCgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Funct ion cublasCgerc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Funct ion cublasCgeru() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Funct ion cublasChbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Funct ion cublasChemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Funct ion cublasCher() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Funct ion cublasCher2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Funct ion cublasChpmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Funct ion cublasChpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Funct ion cublasChpr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Funct ion cublasCt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Funct ion cublasCt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Funct ion cublasCt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Funct ion cublasCt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Funct ion cublasCt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Funct ion cublasCt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4. Doubl e-Pr eci si on BLAS2 Funct i ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Double-Precision BLAS2 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Funct ion cublasDgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Funct ion cublasDgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Funct ion cublasDger() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Funct ion cublasDsbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Funct ion cublasDspmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
vi i i PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDspr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Funct ion cublasDspr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Funct ion cublasDsymv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Funct ion cublasDsyr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Funct ion cublasDsyr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Funct ion cublasDt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Funct ion cublasDt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Funct ion cublasDt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Funct ion cublasDt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Funct ion cublasDt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Funct ion cublasDt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Double-Precision Complex BLAS2 funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Funct ion cublasZgbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Funct ion cublasZgemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Funct ion cublasZgerc() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Funct ion cublasZgeru() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Funct ion cublasZhbmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Funct ion cublasZhemv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Funct ion cublasZher() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Funct ion cublasZher2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Funct ion cublasZhpmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Funct ion cublasZhpr() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Funct ion cublasZhpr2() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Funct ion cublasZt bmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Funct ion cublasZt bsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Funct ion cublasZt pmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Funct ion cublasZt psv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Funct ion cublasZt rmv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Funct ion cublasZt rsv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5. BLAS3 Funct i ons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Single-Precision BLAS3 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Funct ion cublasSgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Funct ion cublasSsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Funct ion cublasSsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Funct ion cublasSsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Funct ion cublasSt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Funct ion cublasSt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Single-Precision Complex BLAS3 Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Funct ion cublasCgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Funct ion cublasChemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Funct ion cublasCherk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Funct ion cublasCher2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Funct ion cublasCsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
PG-05326-032_V02 i x
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Funct ion cublasCsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Funct ion cublasCt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Funct ion cublasCt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Double-Precision BLAS3 Funct ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Funct ion cublasDgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Funct ion cublasDsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Funct ion cublasDsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Funct ion cublasDsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Funct ion cublasDt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Funct ion cublasDt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Double-Precision Complex BLAS3 Funct ions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Funct ion cublasZgemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Funct ion cublasZhemm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Funct ion cublasZherk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Funct ion cublasZher2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Funct ion cublasZsymm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Funct ion cublasZsyrk() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Funct ion cublasZsyr2k() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Funct ion cublasZt rmm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Funct ion cublasZt rsm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
A. CUBLAS For t r an Bi ndi ngs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
PG- 05326- 032_V02 11
NVI DI A
CH A P T E R
1
The CUBLAS Li br ar y
CUBLASisanimplementationofBLAS(BasicLinearAlgebra
Subprograms)ontopoftheNVIDIA
CUDA
runtime.Itallows
accesstothecomputationalresourcesofNVIDIAGPUs.Thelibraryis
selfcontainedattheAPIlevel,thatis,nodirectinteractionwiththe
CUDAdriverisnecessary.CUBLASattachestoasingleGPUanddoes
notautoparallelizeacrossmultipleGPUs.
ThebasicmodelbywhichapplicationsusetheCUBLASlibraryisto
creatematrixandvectorobjectsinGPUmemoryspace,fillthemwith
data,callasequenceofCUBLASfunctions,and,finally,uploadthe
resultsfromGPUmemoryspacebacktothehost.Toaccomplishthis,
CUBLASprovideshelperfunctionsforcreatinganddestroyingobjects
inGPUspace,andforwritingdatatoandretrievingdatafromthese
objects.
FormaximumcompatibilitywithexistingFortranenvironments,
CUBLASusescolumnmajorstorageand1basedindexing.SinceC
andC++userowmajorstorage,applicationscannotusethenative
arraysemanticsfortwodimensionalarrays.Instead,macrosorinline
functionsshouldbedefinedtoimplementmatricesontopofone
dimensionalarrays.ForFortrancodeportedtoCinmechanical
fashion,onemaychosetoretain1basedindexingtoavoidtheneedto
12 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
transformloops.Inthiscase,thearrayindexofamatrixelementin
rowiandcolumnjcanbecomputedviathefollowingmacro:
Here,ldreferstotheleadingdimensionofthematrixasallocated,
whichinthecaseofcolumnmajorstorageisthenumberofrows.For
nativelywrittenCandC++code,onewouldmostlikelychose0based
indexing,inwhichcasetheindexingmacrobecomes
Pleaserefertothecodeexamplesattheendofthissection,which
showatinyapplicationimplementedinFortranonthehost
(Example 1.Fortran77ApplicationExecutingontheHost)and
showversionsoftheapplicationwritteninCusingCUBLASforthe
indexingstylesdescribedabove(Example 2.ApplicationUsingCand
CUBLAS:1basedIndexingandExample 3.ApplicationUsingC
andCUBLAS:0basedIndexing).
BecausetheCUBLAScorefunctions(asopposedtothehelper
functions)donotreturnerrorstatusdirectly(forreasonsof
compatibilitywithexistingBLASlibraries),CUBLASprovidesa
separatefunctiontoaidindebuggingthatretrievesthelastrecorded
error.
TheinterfacetotheCUBLASlibraryistheheaderfilecublas.h.
ApplicationsusingCUBLASneedtolinkagainsttheDSOcublas.so
(Linux),theDLLcublas.dll(Windows),orthedynamiclibrary
cublas.dylib(MacOSX)whenbuildingforthedevice,andagainst
theDSOcublasemu.so(Linux),theDLLcublasemu.dll(Windows),
orthedynamiclibrarycublasemu.dylib(MacOSX)whenbuilding
fordeviceemulation.
Followingthesethreeexamples,theremainderofthischapter
discussesCUBLASTypesonpage 18andCUBLASHelper
Functionsonpage 19.
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
PG- 05326- 032_V02 13
NVI DI A
CHAPTER 1 The CUBLAS Library
Example 1. Fort ran 77 Applicat ion Execut ing on t he Host

subroutine modify (m, ldm, n, p, q, alpha, beta)
implicit none
integer ldm, n, p, q
real*4 m(ldm,*), alpha, beta
external sscal
call sscal (n-p+1, alpha, m(p,q), ldm)
call sscal (ldm-p+1, beta, m(p,q), 1)
return
end

program matrixmod
implicit none
integer M, N
parameter (M=6, N=5)
real*4 a(M,N)
integer i, j
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
enddo
enddo
call modify (a, M, N, 2, 3, 16.0, 12.0)
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
14 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Example 2. Applicat ion Using C and CUBLAS: 1- based I ndexing

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
void modify (float *m, int ldm, int n, int p, int q, float alpha,
float beta)
{
cublasSscal (n-p+1, alpha, &m[IDX2F(p,q,ldm)], ldm);
cublasSscal (ldm-p+1, beta, &m[IDX2F(p,q,ldm)], 1);
}
#define M 6
#define N 5
int main (void)
{
int i, j;
cublasStatus stat;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
return EXIT_FAILURE;
}
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
a[IDX2F(i,j,M)] = (i-1) * M + j;
}
}
cublasInit();
stat = cublasAlloc (M*N, sizeof(*a), (void**)&devPtrA);
PG- 05326- 032_V02 15
NVI DI A
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("device memory allocation failed");
cublasShutdown();
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
printf ("data download failed");
cublasFree (devPtrA);
cublasShutdown();
}
modify (devPtrA, M, N, 2, 3, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
printf ("data upload failed");
cublasShutdown();
}
cublasShutdown();
for (j = 1; j <= N; j++) {
for (i = 1; i <= M; i++) {
printf ("%7.0f", a[IDX2F(i,j,M)]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
Example 2. Applicat ion Using C and CUBLAS: 1- based I ndexing ( cont inued)
16 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Example 3. Applicat ion Using C and CUBLAS: 0- based I ndexing
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "cublas.h"
#define IDX2C(i,j,ld) (((j)*(ld))+(i))
void modify (float *m, int ldm, int n, int p, int q, float alpha,
float beta)
{
cublasSscal (n-p, alpha, &m[IDX2C(p,q,ldm)], ldm);
cublasSscal (ldm-p, beta, &m[IDX2C(p,q,ldm)], 1);
}
#define M 6
#define N 5
int main (void)
{
int i, j;
cublasStatus stat;
float* devPtrA;
float* a = 0;
a = (float *)malloc (M * N * sizeof (*a));
if (!a) {
printf ("host memory allocation failed");
}
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
a[IDX2C(i,j,M)] = i * M + j + 1;
}
}
cublasInit();
stat = cublasAlloc (M*N, sizeof(*a), (void**)&devPtrA);
PG- 05326- 032_V02 17
NVI DI A
printf ("device memory allocation failed");
cublasShutdown();
}
stat = cublasSetMatrix (M, N, sizeof(*a), a, M, devPtrA, M);
printf ("data download failed");
cublasShutdown();
}
modify (devPtrA, M, N, 1, 2, 16.0f, 12.0f);
stat = cublasGetMatrix (M, N, sizeof(*a), devPtrA, M, a, M);
printf ("data upload failed");
cublasShutdown();
}
cublasShutdown();
for (j = 0; j < N; j++) {
for (i = 0; i < M; i++) {
printf ("%7.0f", a[IDX2C(i,j,M)]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
Example 3. Applicat ion Using C and CUBLAS: 0- based I ndexing ( cont inued)
18 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
CUBLAS Types
TheonlyCUBLAStypeiscublasStatus.
Type cublasSt at us
ThetypecublasStatusisusedforfunctionstatusreturns.CUBLAS
helperfunctionsreturnstatusdirectly,whilethestatusofCUBLAS
corefunctionscanberetrievedviacublasGetError().Currently,the
followingvaluesaredefined:
cublasSt at us Values
CUBLAS_STATUS_SUCCESS
operation completed successfully
CUBLAS_STATUS_NOT_INITIALIZED
CUBLAS library not initialized
CUBLAS_STATUS_ALLOC_FAILED
resource allocation failed
CUBLAS_STATUS_INVALID_VALUE
unsupported numerical value was
passed to function
CUBLAS_STATUS_ARCH_MISMATCH
function requires an architectural
feature absent from the architecture of
the device
CUBLAS_STATUS_MAPPING_ERROR
access to GPU memory space failed
CUBLAS_STATUS_EXECUTION_FAILED
GPU program failed to execute
CUBLAS_STATUS_INTERNAL_ERROR
an internal CUBLAS operation failed
PG- 05326- 032_V02 19
NVI DI A
CUBLAS Helper Funct ions
ThefollowingaretheCUBLAShelperfunctions:
FunctioncublasInit()onpage 19
FunctioncublasShutdown()onpage 20
FunctioncublasGetError()onpage 20
FunctioncublasAlloc()onpage 20
FunctioncublasFree()onpage 21
FunctioncublasSetVector()onpage 21
FunctioncublasGetVector()onpage 22
FunctioncublasSetMatrix()onpage 23
FunctioncublasGetMatrix()onpage 23
FunctioncublasSetKernelStream()onpage 24
FunctioncublasSetVectorAsync()onpage 24
FunctioncublasGetVectorAsync()onpage 25
FunctioncublasSetMatrixAsync()onpage 25
FunctioncublasGetMatrixAsync()onpage 26
Funct ion cublasI nit ( )
cublasStatus
cublasInit (void)
initializestheCUBLASlibraryandmustbecalledbeforeanyother
CUBLASAPIfunctionisinvoked.Itallocateshardwareresources
necessaryforaccessingtheGPU.ItattachesCUBLAStowhatever
GPUiscurrentlyboundtothehostthreadfromwhichitwasinvoked.
Ret urn Values
if resources could not be allocated
if CUBLAS library initialized successfully
20 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasShut down( )
cublasStatus
cublasShutdown (void)
releasesCPUsideresourcesusedbytheCUBLASlibrary.Therelease
ofGPUsideresourcesmaybedeferreduntiltheapplicationshuts
down.
Funct ion cublasGet Error( )
cublasStatus
cublasGetError (void)
returnsthelasterrorthatoccurredoninvocationofanyofthe
CUBLAScorefunctions.WhiletheCUBLAShelperfunctionsreturn
statusdirectly,theCUBLAScorefunctionsdonot,improving
compatibilitywiththoseexistingenvironmentsthatdonotexpect
BLASfunctionstoreturnstatus.Readingtheerrorstatusvia
cublasGetError()resetstheinternalerrorstateto
CUBLAS_STATUS_SUCCESS.
Funct ion cublasAlloc( )
cublasStatus
cublasAlloc (int n, int elemSize, void **devicePtr)
createsanobjectinGPUmemoryspacecapableofholdinganarrayof
nelements,whereeachelementrequireselemSizebytesofstorage.If
thefunctioncallissuccessful,apointertotheobjectinGPUmemory
spaceisplacedindevicePtr.Notethatthisisadevicepointerthat
cannotbedereferencedinhostcode.FunctioncublasAlloc()isa
wrapperaroundcudaMalloc().Devicepointersreturnedby
Ret urn Values
if CUBLAS library was not initialized
CUBLAS library shut down successfully
PG- 05326- 032_V02 21
NVI DI A
cublasAlloc()canthereforebepassedtoanyCUDAdevicekernels,
notjustCUBLASfunctions.
Funct ion cublasFree( )
cublasStatus
cublasFree (const void *devicePtr)
destroystheobjectinGPUmemoryspacereferencedbydevicePtr.
Funct ion cublasSet Vect or( )
cublasStatus
cublasSetVector (int n, int elemSize, const void *x,
int incx, void *y, int incy)
copiesnelementsfromavectorxinCPUmemoryspacetoavectory
inGPUmemoryspace.Elementsinbothvectorsareassumedtohavea
sizeofelemSizebytes.Storagespacingbetweenconsecutiveelements
isincxforthesourcevectorxandincyforthedestinationvectory.In
general,ypointstoanobject,orpartofanobject,allocatedvia
cublasAlloc().Columnmajorformatfortwodimensionalmatrices
isassumedthroughoutCUBLAS.Ifthevectorispartofamatrix,a
vectorincrementequalto1accessesa(partial)columnofthematrix.
Ret urn Values
if n <= 0 or elemSize <= 0
if the object could not be allocated
due to lack of resources.
if storage was successfully allocated
Ret urn Values
CUBLAS_STATUS_INTERNAL_ERROR
if the object could not be deallocated
if object was deallocated successfully
22 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Similarly,usinganincrementequaltotheleadingdimensionofthe
matrixaccessesa(partial)row.
Funct ion cublasGet Vect or( )
cublasStatus
cublasGetVector (int n, int elemSize, const void *x,
int incx, void *y, int incy)
copiesnelementsfromavectorxinGPUmemoryspacetoavectory
inCPUmemoryspace.Elementsinbothvectorsareassumedtohavea
sizeofelemSizebytes.Storagespacingbetweenconsecutiveelements
isincxforthesourcevectorxandincyforthedestinationvectory.In
general,xpointstoanobject,orpartofanobject,allocatedvia
cublasAlloc().Columnmajorformatfortwodimensionalmatrices
isassumedthroughoutCUBLAS.Ifthevectorispartofamatrix,a
vectorincrementequalto1accessesa(partial)columnofthematrix.
Similarly,usinganincrementequaltotheleadingdimensionofthe
matrixaccessesa(partial)row.
Ret urn Values
if incx, incy, or elemSize <= 0
if error accessing GPU memory
if operation completed successfully
Ret urn Values
PG- 05326- 032_V02 23
NVI DI A
Funct ion cublasSet Mat rix( )
cublasStatus
cublasSetMatrix (int rows, int cols, int elemSize,
const void *A, int lda, void *B,
int ldb)
copiesatileofrowscolselementsfromamatrixAinCPUmemory
spacetoamatrixBinGPUmemoryspace.Eachelementrequires
storageofelemSizebytes.Bothmatricesareassumedtobestoredin
columnmajorformat,withtheleadingdimension(thatis,thenumber
ofrows)ofsourcematrixAprovidedinlda,andtheleading
dimensionofdestinationmatrixBprovidedinldb.Bisadevice
pointerthatpointstoanobject,orpartofanobject,thatwasallocated
inGPUmemoryspaceviacublasAlloc().
Funct ion cublasGet Mat rix( )
cublasStatus
cublasGetMatrix (int rows, int cols, int elemSize,
int ldb)
copiesatileofrowscolselementsfromamatrixAinGPUmemory
spacetoamatrixBinCPUmemoryspace.Eachelementrequires
storageofelemSizebytes.Bothmatricesareassumedtobestoredin
columnmajorformat,withtheleadingdimension(thatis,thenumber
ofrows)ofsourcematrixAprovidedinlda,andtheleading
dimensionofdestinationmatrixBprovidedinldb.Aisadevice
Ret urn Values
if rows or cols < 0; or elemSize,
lda, or ldb <= 0
24 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
pointerthatpointstoanobject,orpartofanobject,thatwasallocated
inGPUmemoryspaceviacublasAlloc().
Funct ion cublasSet KernelSt ream( )
cublasStatus
cublasSetKernelStream (cudaStream_t stream)
setstheCUBLASstreaminwhichallsubsequentCUBLASkernel
launcheswillrun.
Bydefault,iftheCUBLASstreamisnotset,allkernelsusetheNULL
stream.Thisroutinecanbeusedtochangethestreambetweenkernel
launchesandcanbeusedalsotosettheCUBLASstreambacktoNULL.
Funct ion cublasSet Vect orAsync( )
cublasStatus
cublasSetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy,
cudaStream_t stream);
hasthesamefunctionalityascublasSetVector(),butthetransferis
doneasynchronouslywithintheCUDAstreampassedinvia
parameter.
Ret urn Values
lda, or ldb <= 0
Ret urn Values
if stream set successfully
Ret urn Values
PG- 05326- 032_V02 25
NVI DI A
Funct ion cublasGet Vect orAsync( )
cublasStatus
cublasGetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy,
cudaStream_t stream)
hasthesamefunctionalityascublasGetVector(),butthetransferis
parameter.
Funct ion cublasSet Mat rixAsync( )
cublasStatus
cublasSetMatrixAsync (int rows, int cols, int elemSize,
int ldb, cudaStream_t stream)
hasthesamefunctionalityascublasSetMatrix(),butthetransferis
parameter.
Ret urn Values ( cont inued)
Ret urn Values
Ret urn Values
lda, or ldb <= 0
26 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasGet Mat rixAsync( )
cublasStatus
cublasGetMatrixAsync (int rows, int cols, int elemSize,
int ldb, cudaStream_t stream)
hasthesamefunctionalityascublasGetMatrix(),butthetransferis
parameter.
Ret urn Values
if rows, cols, elemSize, lda, or
ldb <= 0
PG- 05326- 032_V02 27
NVI DI A
CH A P T E R
2
BLAS1 Funct i ons
Level1BasicLinearAlgebraSubprograms(BLAS1)arefunctionsthat
performscalar,vector,andvectorvectoroperations.TheCUBLAS
BLAS1implementationisdescribedinthesesections:
SinglePrecisionBLAS1Functionsonpage 28
SinglePrecisionComplexBLAS1Functionsonpage 41
DoublePrecisionBLAS1Functionsonpage 55
DoublePrecisionComplexBLAS1functionsonpage 69
28 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Single- Precision BLAS1 Funct ions
ThesingleprecisionBLAS1functionsareasfollows:
FunctioncublasIsamax()onpage 29
FunctioncublasIsamin()onpage 29
FunctioncublasSasum()onpage 30
FunctioncublasSaxpy()onpage 31
FunctioncublasScopy()onpage 32
FunctioncublasSdot()onpage 33
FunctioncublasSnrm2()onpage 34
FunctioncublasSrot()onpage 34
FunctioncublasSrotg()onpage 35
FunctioncublasSrotm()onpage 36
FunctioncublasSrotmg()onpage 38
FunctioncublasSscal()onpage 39
FunctioncublasSswap()onpage 40
PG- 05326- 032_V02 29
NVI DI A
CHAPTER 2 BLAS1 Funct ions
Funct ion cublasI samax( )
int
cublasIsamax (int n, const float *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementofsingle
precisionvectorx;thatis,theresultisthefirsti,i=0ton-1,that
maximizes .Theresultreflects1basedindexing
forcompatibilitywithFortran.
Reference:http://www.netlib.org/blas/isamax.f
ErrorstatusforthisfunctioncanberetrievedviacublasGetError().
Funct ion cublasI samin( )
int
cublasIsamin (int n, const float *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementofsingle
precisionvectorx;thatis,theresultisthefirsti,i=0ton-1,that
minimizes .Theresultreflects1basedindexing
forcompatibilitywithFortran.
I nput
n
number of elements in input vector
x
single-precision vector with n elements
incx
storage spacing between elements of x
Out put
returns the smallest index (returns zero if n <= 0 or incx <= 0)
Error St at us
if function could not allocate
reduction buffer
if function failed to launch on GPU
abs x 1 i incx * + | | ( )
I nput
n
x
incx
abs x 1 i incx * + | | ( )
30 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/scilib/blass.f
Funct ion cublasSasum( )
float
cublasSasum (int n, const float *x, int incx)
computesthesumoftheabsolutevaluesoftheelementsofsingle
precisionvectorx;thatis,theresultisthesumfromi=0ton-1of
.
Reference:http://www.netlib.org/blas/sasum.f
Out put
Error St at us
reduction buffer
I nput
n
x
incx
Out put
returns the single-precision sum of absolute values
(returns zero if n <= 0 or incx <= 0, or if an error occurred)
Error St at us
reduction buffer
abs x 1 i incx * + | | ( )
PG- 05326- 032_V02 31
NVI DI A
Funct ion cublasSaxpy( )
void
cublasSaxpy (int n, float alpha, const float *x,
int incx, float *y, int incy)
multipliessingleprecisionvectorxbysingleprecisionscalaralpha
andaddstheresulttosingleprecisionvectory;thatis,itoverwrites
singleprecisionywithsingleprecision .
Fori=0ton-1,itreplaces
where
lyisdefinedinasimilarwayusingincy.
Reference:http://www.netlib.org/blas/saxpy.f
with ,
lx = 0 ifincx>=0,else
;
I nput
n
number of elements in input vectors
alpha
single-precision scalar multiplier
x
incx
y
incy
storage spacing between elements of y
Out put
y
single-precision result (unchanged if n <= 0)
Error St at us
alpha x y + *
y ly i incy * + | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
32 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasScopy( )
void
cublasScopy (int n, const float *x, int incx, float *y,
int incy)
copiesthesingleprecisionvectorxtothesingleprecisionvectory.For
i=0ton-1,itcopies
where
Reference:http://www.netlib.org/blas/scopy.f
to ,
lx = 1ifincx>=0,else
;
I nput
n
x
incx
y
incy
Out put
y
contains single-precision vector x
Error St at us
x lx i incx * + | | y ly i incy * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 33
NVI DI A
Funct ion cublasSdot ( )
float
cublasSdot (int n, const float *x, int incx,
const float *y, int incy)
computesthedotproductoftwosingleprecisionvectors.Itreturns
thedotproductofthesingleprecisionvectorsxandyifsuccessful,
and0.0fotherwise.Itcomputesthesumfori=0ton-1of
where
Reference:http://www.netlib.org/blas/sdot.f
,
;
I nput
n
x
incx
y
incy
Out put
returns single-precision dot product (returns zero if n <= 0)
Error St at us
reduction buffer
if function failed to execute on GPU
x lx i incx * + | | y ly i + incy * | | *
lx 1 1 n ( ) incx * + =
34 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSnrm2( )
float
cublasSnrm2 (int n, const float *x, int incx)
computestheEuclideannormofthesingleprecisionnvectorx(with
storageincrementincx).Thiscodeusesamultiphasemodelof
accumulationtoavoidintermediateunderflowandoverflow.
Reference:http://www.netlib.org/blas/snrm2.f
Reference:http://www.netlib.org/slatec/lin/snrm2.f
Funct ion cublasSrot ( )
void
cublasSrot (int n, float *x, int incx, float *y, int incy,
float sc, float ss)
multipliesa22matrix withthe2nmatrix .
Theelementsofxarein ,i=0ton-1,where
I nput
n
x
incx
Out put
returns the Euclidian norm
(returns zero if n <= 0, incx <= 0, or if an error occurred)
Error St at us
reduction buffer
;
sc ss
ss sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 35
NVI DI A
yistreatedsimilarlyusinglyandincy.
Reference:http://www.netlib.org/blas/srot.f
Funct ion cublasSrot g( )
void
cublasSrotg (float *host_sa, float *host_sb,
float *host_sc, float *host_ss)
constructstheGivenstransformation
whichzerosthesecondentryofthe2vector .
I nput
n
x
incx
y
incy
sc
element of rotation matrix
ss
Out put
x
rotated vector x (unchanged if n <= 0)
y
rotated vector y (unchanged if n <= 0)
Error St at us
G
sc ss
ss sc
, sc
2
ss
2
+ 1 = =
sa sb
T
36 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Thequantity overwritessainstorage.Thevalueof
sbisoverwrittenbyavaluezwhichallowsscandsstoberecovered
bythefollowingalgorithm:
ThefunctioncublasSrot(n,x,incx,y,incy,sc,ss)normallyis
callednexttoapplythetransformationtoa2nmatrix.Notethatthis
functionisprovidedforcompletenessandisrunexclusivelyonthe
host.
Reference:http://www.netlib.org/blas/srotg.f
Thisfunctiondoesnotsetanyerrorstatus.
Funct ion cublasSrot m( )
void cublasSrotm (int n, float *x, int incx, float *y,
int incy, const float *sparam)
appliesthemodifiedGivenstransformation,h,tothe2nmatrix
if set and .
if
set and .
if
set and .
I nput
sa
single-precision scalar
sb
single-precision scalar
Out put
sa
single-precision r
sb
single-precision z
sc
single-precision result
ss
single-precision result
r sa
2
sb
2
+ =
z 1 = sc 0.0 = ss 1.0 =
abs z ( ) 1 <
sc 1 z
2
= ss z =
abs z ( ) 1 >
sc 1 z = ss 1 sc
2
=
;
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 37
NVI DI A
Withsparam[0]=sflag,hhasoneofthefollowingforms:
Reference:http://www.netlib.org/blas/srotm.f
I nput
n
number of elements in input vectors.
x
single-precision vector with n elements.
incx
storage spacing between elements of x.
y
single-precision vector with n elements.
incy
storage spacing between elements of y.
sparam
5-element vector. sparam[0] is sflag described above. sparam[1]
through sparam[4] contain the 22 rotation matrix h: sparam[1]
contains sh00, sparam[2] contains sh10, sparam[3] contains
sh01, and sparam[4] contains sh11.
Out put
x
y
Error St at us
sflag -1.0f =
h
sh00 sh01
sh10 sh11
=
sflag 1.0f =
h
sh00 1.0f
-1.0f sh11
=
sflag 0.0f =
h
1.0f sh01
sh10 1.0f
=
sflag -2.0f =
h
1.0f 0.0f
0.0f 1.0f
=
38 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSrot mg( )
void
cublasSrotmg (float *host_sd1, float *host_sd2,
float *host_sx1, const float *host_sy1,
float *host_sparam)
constructsthemodifiedGivenstransformationmatrixhwhichzeros
thesecondcomponentofthe2vector .
Withsparam[0]=sflag,hhasoneofthefollowingforms:
sparam[1]throughsparam[4]containsh00,sh10,sh01,andsh11,
respectively.Valuesof1.0f,-1.0f,or0.0fimpliedbythevalueof
sflagarenotstoredinsparam.Notethatthisfunctionisprovidedfor
completenessandisrunexclusivelyonthehost.
I nput
sd1
single-precision scalar.
sd2
sx1
sy1
Out put
sd1
changed to represent the effect of the transformation.
sd2
sx1
sparam
5-element vector. sparam[0] is sflag described above. sparam[1]
through sparam[4] contain the 22 rotation matrix h: sparam[1]
contains sh00, sparam[2] contains sh10, sparam[3] contains
sh01, and sparam[4] contains sh11.
sd1*sx1 sd2*sy1 , ( )
T
sflag -1.0f =
h
sh00 sh01
sh10 sh11
=
sflag 1.0f =
h
sh00 1.0f
-1.0f sh11
=
sflag 0.0f =
h
1.0f sh01
sh10 1.0f
=
sflag -2.0f =
h
1.0f 0.0f
0.0f 1.0f
=
PG- 05326- 032_V02 39
NVI DI A
Reference:http://www.netlib.org/blas/srotmg.f
Funct ion cublasSscal( )
void
cublasSscal (int n, float alpha, float *x, int incx)
replacessingleprecisionvectorxwithsingleprecisionalpha*x.For
i =0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/sscal.f
with ,
.
I nput
n
alpha
x
incx
Out put
x
single-precision result (unchanged if n <= 0 or incx <= 0)
Error St at us
x lx i incx * + | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
40 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSswap( )
void
cublasSswap (int n, float *x, int incx, float *y,
int incy)
interchangessingleprecisionvectorxwithsingleprecisionvectory.
Fori=0ton-1,itinterchanges
where
lyisdefinedinasimilarmannerusingincy.
Reference:http://www.netlib.org/blas/sswap.f
with ,
;
I nput
n
x
incx
y
incy
Out put
x
single-precision vector y (unchanged from input if n <= 0)
y
single-precision vector x (unchanged from input if n <= 0)
Error St at us
x lx i incx * + | | y ly i + incy * | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 41
NVI DI A
Single- Precision Complex BLAS1 Funct ions
ThesingleprecisioncomplexBLAS1functionsareasfollows:
FunctioncublasCaxpy()onpage 42
FunctioncublasCcopy()onpage 43
FunctioncublasCdotc()onpage 44
FunctioncublasCdotu()onpage 45
FunctioncublasCrot()onpage 46
FunctioncublasCrotg()onpage 47
FunctioncublasCscal()onpage 48
FunctioncublasCsrot()onpage 48
FunctioncublasCsscal()onpage 49
FunctioncublasCswap()onpage 50
FunctioncublasIcamax()onpage 51
FunctioncublasIcamin()onpage 52
FunctioncublasScasum()onpage 52
FunctioncublasScnrm2()onpage 53
42 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCaxpy( )
void
cublasCaxpy (int n, cuComplex alpha, const cuComplex *x,
int incx, cuComplex *y, int incy)
multipliessingleprecisioncomplexvectorxbysingleprecision
complexscalaralphaandaddstheresulttosingleprecisioncomplex
vectory;thatis,itoverwritessingleprecisioncomplexywithsingle
precisioncomplex .
where
Reference:http://www.netlib.org/blas/caxpy.f
with ,
;
I nput
n
alpha
single-precision complex scalar multiplier
x
single-precision complex vector with n elements
incx
y
incy
Out put
y
single-precision complex result (unchanged if n <= 0)
Error St at us
alpha x y + *
y ly i + incy * | | alpha x lx i incx * + | | y ly i incy * + | | + *
lx 1 1 n ( ) + incx * =
PG- 05326- 032_V02 43
NVI DI A
Funct ion cublasCcopy( )
void
cublasCcopy (int n, const cuComplex *x, int incx,
cuComplex *y, int incy)
copiesthesingleprecisioncomplexvectorxtothesingleprecision
complexvectory.
Fori=0ton-1,itcopies
where
Reference:http://www.netlib.org/blas/ccopy.f
to ,
;
I nput
n
x
incx
y
incy
Out put
y
contains single-precision complex vector x
Error St at us
lx 1 1 n ( ) incx * + =
44 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCdot c( )
cuComplex
cublasCdotc (int n, const cuComplex *x, int incx,
const cuComplex *y, int incy)
computesthedotproductoftwosingleprecisioncomplexvectors,the
firstofwhichisconjugated.Itreturnsthedotproductofthecomplex
conjugateofsingleprecisioncomplexvectorxandthesingle
precisioncomplexvectoryifsuccessful,andcomplexzerootherwise.
Fori=0ton-1,itsumstheproducts
where
Reference:http://www.netlib.org/blas/cdotc.f
* ,
;
I nput
n
x
incx
y
incy
Out put
returns single-precision complex dot product (zero if n <= 0)
Error St at us
reduction buffer
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 45
NVI DI A
Funct ion cublasCdot u( )
cuComplex
cublasCdotu (int n, const cuComplex *x, int incx,
const cuComplex *y, int incy)
computesthedotproductoftwosingleprecisioncomplexvectors.It
returnsthedotproductofthesingleprecisioncomplexvectorsxandy
ifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsumsthe
products
where
Reference:http://www.netlib.org/blas/cdotu.f
* ,
;
I nput
n
x
incx
y
incy
Out put
returns single-precision complex dot product (returns zero if n <= 0)
Error St at us
reduction buffer
lx 1 1 n ( ) incx * + =
46 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCrot ( )
void
cublasCrot (int n, cuComplex *x, int incx, cuComplex *y,
int incy, float sc, cuComplex cs)
Reference:http://netlib.org/lapack/explorehtml/crot.f.html
;
I nput
n
x
incx
y
incy
sc
single-precision cosine component of rotation matrix
cs
single-precision complex sine component of rotation matrix
Out put
x
y
Error St at us
sc cs
cs sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 47
NVI DI A
Funct ion cublasCrot g( )
void
cublasCrotg (cuComplex *host_ca, cuComplex cb,
float *host_sc, float *host_cs)
constructsthecomplexGivenstransformation
whichzerosthesecondentryofthecomplex2vector .
Thequantity overwritescainstorage.Inthiscase,
,where
.
ThefunctioncublasCrot(n,x,incx,y,incy,sc,cs)normallyis
host.
Reference:http://www.netlib.org/blas/crotg.f
I nput
ca
single-precision complex scalar
cb
single-precision complex scalar
Out put
ca
single-precision complex
sc
cs
single-precision complex sine component of rotation matrix
G
sc cs
cs sc
, sc
-
sc cs
-
cs + 1 = =
ca cb
T
ca ca
-
ca, cb
ca, cb scale* ca scale
2
cb scale
2
+ =
scale ca cb + =
ca ca
-
ca, cb
48 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCscal( )
void
cublasCscal (int n, cuComplex alpha, cuComplex *x,
int incx)
replacessingleprecisioncomplexvectorxwithsingleprecision
complexalpha*x.
Fori = 0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/cscal.f
Funct ion cublasCsrot ( )
void
cublasCsrot (int n, cuComplex *x, int incx, cuComplex *y,
int incy, float sc, float ss)
with ,
.
I nput
n
alpha
single-precision complex scalar multiplier
x
incx
Out put
x
single-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
x lx i + incx * | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
sc ss
ss sc
x
T
y
T
PG- 05326- 032_V02 49
NVI DI A
Reference:http://www.netlib.org/blas/csrot.f
Funct ion cublasCsscal( )
void
cublasCsscal (int n, float alpha, cuComplex *x, int incx)
replacessingleprecisioncomplexvectorxwithsingleprecision
complexalpha*x.Fori = 0ton-1,itreplaces
where
;
I nput
n
x
incx
y
incy
sc
ss
single-precision sine component of rotation matrix
Out put
x
y
Error St at us
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
with ,
.
lx 1 1 n ( ) incx * + =
50 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/csscal.f
Funct ion cublasCswap( )
void
cublasCswap (int n, const cuComplex *x, int incx,
cuComplex *y, int incy)
interchangesthesingleprecisioncomplexvectorxwiththesingle
precisioncomplexvectory.Fori=0ton-1,itinterchanges
where
I nput
n
alpha
x
incx
Out put
x
single-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
with ,
;
I nput
n
x
incx
y
incy
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 51
NVI DI A
Reference:http://www.netlib.org/blas/cswap.f
Funct ion cublasI camax( )
int
cublasIcamax (int n, const cuComplex *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementofsingle
precisioncomplexvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatmaximizes .Theresultreflects1based
indexingforcompatibilitywithFortran.
Reference:http://www.netlib.org/blas/icamax.f
Out put
x
single-precision complex vector y (unchanged from input if n <= 0)
y
single-precision complex vector x (unchanged from input if n <= 0)
Error St at us
I nput
n
x
incx
Out put
Error St at us
reduction buffer
abs x 1 i incx * + | | ( )
52 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasI camin( )
int
cublasIcamin (int n, const cuComplex *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementofsingle
precisioncomplexvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatminimizes .Theresultreflects1based
Reference:Analogoustohttp://www.netlib.org/blas/icamax.f
Funct ion cublasScasum( )
float
cublasScasum (int n, const cuDouble *x, int incx)
takesthesumoftheabsolutevaluesofacomplexvectorandreturnsa
singleprecisionresult.NotethatthisisnottheL1normofthevector.
Theresultisthesumfrom0ton-1of
where
I nput
n
x
incx
Out put
Error St at us
reduction buffer
abs x 1 i incx * + | | ( )
,
lx = 1ifincx<=0,else
.
abs real x lx i incx * + | | ( ) ( ) abs imag x lx i incx * + | | ( ) ( ) +
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 53
NVI DI A
Reference:http://www.netlib.org/blas/scasum.f
Funct ion cublasScnrm2( )
float
cublasScnrm2 (int n, const cuComplex *x, int incx)
computestheEuclideannormofsingleprecisioncomplexnvectorx.
Thisimplementationusessimplescalingtoavoidintermediate
underflowandoverflow.
Reference:http://www.netlib.org/blas/scnrm2.f
I nput
n
x
incx
Out put
returns the single-precision sum of absolute values of real and imaginary parts
Error St at us
reduction buffer
I nput
n
x
incx
Out put
54 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Error St at us
reduction buffer
PG- 05326- 032_V02 55
NVI DI A
Double- Precision BLAS1 Funct ions
Note: DoubleprecisionfunctionsareonlysupportedonGPUswithdouble
precisionhardware.
ThedoubleprecisionBLAS1functionsareasfollows:
FunctioncublasIdamax()onpage 56
FunctioncublasIdamin()onpage 56
FunctioncublasDasum()onpage 57
FunctioncublasDaxpy()onpage 58
FunctioncublasDcopy()onpage 59
FunctioncublasDdot()onpage 60
FunctioncublasDnrm2()onpage 61
FunctioncublasDrot()onpage 62
FunctioncublasDrotg()onpage 63
FunctioncublasDrotm()onpage 64
FunctioncublasDrotmg()onpage 65
FunctioncublasDscal()onpage 66
FunctioncublasDswap()onpage 67
56 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasI damax( )
int
cublasIdamax (int n, const double *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementof
doubleprecisionvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatmaximizes .Theresultreflects1based
Reference:http://www.netlib.org/blas/idamax.f
Funct ion cublasI damin( )
int
cublasIdamin (int n, const double *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementof
doubleprecisionvectorx;thatis,theresultisthefirsti,i=0ton-1,
thatminimizes .Theresultreflects1based
I nput
n
x
double-precision vector with n elements
incx
Out put
Error St at us
reduction buffer
if function invoked on device that
does not support double precision
abs x 1 i incx * + | | ( )
abs x 1 i incx * + | | ( )
PG- 05326- 032_V02 57
NVI DI A
Analogoustohttp://www.netlib.org/blas/idamax.f
Funct ion cublasDasum( )
double
cublasDasum (int n, const double *x, int incx)
computesthesumoftheabsolutevaluesoftheelementsofdouble
precisionvectorx;thatis,theresultisthesumfromi=0ton-1of
.
Reference:http://www.netlib.org/blas/dasum.f
I nput
n
x
incx
Out put
Error St at us
reduction buffer
I nput
n
x
incx
Out put
returns the double-precision sum of absolute values
(returns zero if n <= 0 or incx <= 0, or if an error occurred)
abs x 1 i incx * + | | ( )
58 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDaxpy( )
void
cublasDaxpy (int n, double alpha, const double *x,
int incx, double *y, int incy)
multipliesdoubleprecisionvectorxbydoubleprecisionscalaralpha
andaddstheresulttodoubleprecisionvectory;thatis,itoverwrites
doubleprecisionywithdoubleprecision .
where
Error St at us
reduction buffer
with ,
;
I nput
n
alpha
double-precision scalar multiplier
x
incx
y
incy
Out put
y
double-precision result (unchanged if n <= 0)
alpha x y + *
lx 1 1 n ( ) + incx * =
PG- 05326- 032_V02 59
NVI DI A
Reference:http://www.netlib.org/blas/daxpy.f
Funct ion cublasDcopy( )
void
cublasDcopy (int n, const double *x, int incx, double *y,
int incy)
copiesthedoubleprecisionvectorxtothedoubleprecisionvectory.
Fori=0ton-1,itcopies
where
Reference:http://www.netlib.org/blas/dcopy.f
Error St at us
to ,
;
I nput
n
x
incx
y
incy
Out put
y
contains double-precision vector x
lx 1 1 n ( ) incx * + =
60 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDdot ( )
double
cublasDdot (int n, const double *x, int incx,
const double *y, int incy)
computesthedotproductoftwodoubleprecisionvectors.Itreturns
thedotproductofthedoubleprecisionvectorsxandyifsuccessful,
and0.0otherwise.Itcomputesthesumfori=0ton-1of
where
Reference:http://www.netlib.org/blas/ddot.f
Error St at us
,
;
I nput
n
x
incx
y
incy
Out put
returns double-precision dot product (returns zero if n <= 0)
x lx i incx * + | | y ly i + incy * | | *
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 61
NVI DI A
Funct ion cublasDnrm2( )
double
cublasDnrm2 (int n, const double *x, int incx)
computestheEuclideannormofthedoubleprecisionnvectorx(with
storageincrementincx).Thiscodeusesamultiphasemodelof
accumulationtoavoidintermediateunderflowandoverflow.
Reference:http://www.netlib.org/blas/dnrm2.f
Reference:http://www.netlib.org/slatec/lin/dnrm2.f
Error St at us
reduction buffer
I nput
n
x
incx
Out put
Error St at us
reduction buffer
62 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDrot ( )
void
cublasDrot (int n, double *x, int incx, double *y,
int incy, double dc, double ds)
Reference:http://www.netlib.org/blas/drot.f
;
I nput
n
x
incx
y
incy
dc
ds
Out put
x
y
Error St at us
dc ds
ds dc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 63
NVI DI A
Funct ion cublasDrot g( )
void
cublasDrotg (double *host_da, double *host_db,
double *host_dc, double *host_ds)
constructstheGivenstransformation
whichzerosthesecondentryofthe2vector .
Thequantity overwritesdainstorage.Thevalueof
dbisoverwrittenbyavaluezwhichallowsdcanddstoberecovered
bythefollowingalgorithm:
ThefunctioncublasDrot(n,x,incx,y,incy,dc,ds)normallyis
host.
Reference:http://www.netlib.org/blas/drotg.f
if set and .
if
set and .
if
set and .
I nput
da
double-precision scalar
db
Out put
da
double-precision r
db
double-precision z
dc
double-precision result
ds
double-precision result
G
dc ds
ds dc
, dc
2
ds
2
+ 1 = =
da db
T
r da
2
db
2
+ =
z 1 = dc 0.0 = ds 1.0 =
abs z ( ) 1 <
dc 1 z
2
= ds z =
abs z ( ) 1 >
dc 1 z = ds 1 dc
2
=
64 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDrot m( )
void
cublasDrotm (int n, double *x, int incx, double *y,
int incy, const double *dparam)
appliesthemodifiedGivenstransformation,h,tothe2nmatrix
Withdparam[0]=dflag,hhasoneofthefollowingforms:
;
I nput
n
number of elements in input vectors.
x
double-precision vector with n elements.
incx
storage spacing between elements of x.
y
double-precision vector with n elements.
incy
storage spacing between elements of y.
dparam
5-element vector. dparam[0] is dflag described above. dparam[1]
through dparam[4] contain the 22 rotation matrix h: dparam[1]
contains dh00, dparam[2] contains dh10, dparam[3] contains
dh01, and dparam[4] contains dh11.
Out put
x
y
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
dflag -1.0 =
h
dh00 dh01
dh10 dh11
=
dflag 1.0 =
h
dh00 1.0
-1.0 dh11
=
dflag 0.0 =
h
1.0 dh01
dh10 1.0
=
dflag -2.0 =
h
1.0 0.0
0.0 1.0
=
PG- 05326- 032_V02 65
NVI DI A
Reference:http://www.netlib.org/blas/drotm.f
Funct ion cublasDrot mg( )
void
cublasDrotmg (double *host_dd1, double *host_dd2,
double *host_dx1, const double *host_dy1,
double *host_dparam)
constructsthemodifiedGivenstransformationmatrixhwhichzeros
thesecondcomponentofthe2vector .
Withdparam[0]=dflag,hhasoneofthefollowingforms:
dparam[1]throughdparam[4]containdh00,dh10,dh01,anddh11,
respectively.Valuesof1.0,-1.0,or0.0impliedbythevalueofdflag
arenotstoredindparam.Notethatthisfunctionisprovidedfor
completenessandisrunexclusivelyonthehost.
Error St at us
I nput
dd1
dd2
dx1
dy1
dd1*dx1 dd2*dy1 , ( )
T
dflag -1.0 =
h
dh00 dh01
dh10 dh11
=
dflag 1.0 =
h
dh00 1.0
-1.0 dh11
=
dflag 0.0 =
h
1.0 dh01
dh10 1.0
=
dflag -2.0 =
h
1.0 0.0
0.0 1.0
=
66 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/drotmg.f
Funct ion cublasDscal( )
void
cublasDscal (int n, double alpha, double *x, int incx)
replacesdoubleprecisionvectorxwithdoubleprecisionalpha*x.
Fori =0ton-1,itreplaces
where
Reference:http://www.netlib.org/blas/dscal.f
Out put
dd1
changed to represent the effect of the transformation
dd2
dx1
dparam
5-element vector. dparam[0] is dflag described above. dparam[1]
through dparam[4] contain the 22 rotation matrix h: dparam[1]
contains dh00, dparam[2] contains dh10, dparam[3] contains
dh01, and dparam[4] contains dh11.
with ,
.
I nput
n
alpha
x
incx
Out put
x
double-precision result (unchanged if n <= 0 or incx <= 0)
x lx i incx * + | | alpha x lx i incx * + | | *
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 67
NVI DI A
Funct ion cublasDswap( )
void
cublasDswap (int n, double *x, int incx, double *y,
int incy)
interchangesdoubleprecisionvectorxwithdoubleprecisionvectory.
Fori=0ton-1,itinterchanges
where
Reference:http://www.netlib.org/blas/dswap.f
Error St at us
with ,
;
I nput
n
x
incx
y
incy
Out put
x
double-precision vector y (unchanged from input if n <= 0)
y
double-precision vector x (unchanged from input if n <= 0)
lx 1 1 n ( ) incx * + =
68 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Error St at us
PG- 05326- 032_V02 69
NVI DI A
Double- Precision Complex BLAS1 funct ions
precisionhardware.
ThedoubleprecisioncomplexBLAS1functionsarelistedbelow:
FunctioncublasDzasum()onpage 70
FunctioncublasDznrm2()onpage 71
FunctioncublasIzamax()onpage 71
FunctioncublasIzamin()onpage 72
FunctioncublasZaxpy()onpage 73
FunctioncublasZcopy()onpage 74
FunctioncublasZdotc()onpage 75
FunctioncublasZdotu()onpage 76
FunctioncublasZdrot()onpage 77
FunctioncublasZdscal()onpage 78
FunctioncublasZrot()onpage 79
FunctioncublasZrotg()onpage 80
FunctioncublasZscal()onpage 80
FunctioncublasZswap()onpage 81
70 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDzasum( )
double
cublasDzasum (int n, const cuDoubleComplex *x, int incx)
takesthesumoftheabsolutevaluesofacomplexvectorandreturnsa
doubleprecisionresult.NotethatthisisnottheL1normofthevector.
Theresultisthesumfrom0ton-1of
where
Referencehttp://www.netlib.org/blas/dzasum.f
,
lx = 1ifincx<=0,else
.
I nput
n
x
double-precision complex vector with n elements
incx
Out put
returns the double-precision sum of absolute values of real and imaginary parts
Error St at us
abs real x lx i incx * + | | ( ) ( ) abs imag x lx i incx * + | | ( ) ( ) +
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 71
NVI DI A
Funct ion cublasDznrm2( )
double
cublasDznrm2 (int n, const cuDoubleComplex *x, int incx)
computestheEuclideannormofdoubleprecisioncomplexnvectorx.
Thisimplementationusessimplescalingtoavoidintermediate
underflowandoverflow.
Reference:http://www.netlib.org/blas/dznrm2.f
Funct ion cublasI zamax( )
int
cublasIzamax (int n, const cuDoubleComplex *x, int incx)
findsthesmallestindexofthemaximummagnitudeelementof
doubleprecisioncomplexvectorx;thatis,theresultisthefirsti,i=0
ton-1,thatmaximizes
.
I nput
n
x
incx
Out put
Error St at us
I nput
n
x
incx
abs real x 1 i incx * + | | ( ) ( ) abs imag x 1 i incx * + | | ( ) ( ) +
72 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/izamax.f
Funct ion cublasI zamin( )
int
cublasIzamin (int n, const cuDoubleComplex *x, int incx)
findsthesmallestindexoftheminimummagnitudeelementof
doubleprecisioncomplexvectorx;thatis,theresultisthefirsti,i=0
ton-1,thatminimizes
.
Reference:analogoustoFunctioncublasIzamax()onpage 71.
Out put
Error St at us
I nput
n
x
incx
Out put
Error St at us
abs real x 1 i incx * + | | ( ) ( ) abs imag x 1 i incx * + | | ( ) ( ) +
PG- 05326- 032_V02 73
NVI DI A
Funct ion cublasZaxpy( )
void
cublasZaxpy (int n, cuDoubleComplex alpha,
const cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy)
multipliesdoubleprecisioncomplexvectorxbydoubleprecision
complexscalaralphaandaddstheresulttodoubleprecisioncomplex
vectory;thatis,itoverwritesdoubleprecisioncomplexywithdouble
precisioncomplex .
where
Reference:http://www.netlib.org/blas/zaxpy.f
with ,
;
I nput
n
alpha
double-precision complex scalar multiplier
x
incx
y
incy
Out put
y
double-precision complex result (unchanged if n <= 0)
Error St at us
alpha x y + *
lx 1 1 n ( ) + incx * =
74 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZcopy( )
void
cublasZcopy (int n, const cuDoubleComplex *x, int incx,
copiesthedoubleprecisioncomplexvectorxtothedoubleprecision
complexvectory.Fori=0ton-1,itcopies
where
Reference:http://www.netlib.org/blas/zcopy.f
to ,
;
I nput
n
x
incx
y
incy
Out put
y
contains double-precision complex vector x
Error St at us
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 75
NVI DI A
Funct ion cublasZdot c( )
cuDoubleComplex
cublasZdotc (int n, const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy)
computesthedotproductoftwodoubleprecisioncomplexvectors.It
returnsthedotproductofthedoubleprecisioncomplexvectorsxand
yifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsums
theproducts
where
Reference:http://www.netlib.org/blas/zdotc.f
* ,
;
I nput
n
x
incx
y
incy
Out put
returns double-precision complex dot product (zero if n <= 0)
Error St at us
reduction buffer
lx 1 1 n ( ) incx * + =
76 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZdot u( )
cuDoubleComplex
cublasZdotu (int n, const cuDoubleComplex *x, int incx,
const cuDoubleComplex *y, int incy)
computesthedotproductoftwodoubleprecisioncomplexvectors.It
returnsthedotproductofthedoubleprecisioncomplexvectorsxand
yifsuccessful,andcomplexzerootherwise.Fori=0ton-1,itsums
theproducts
where
Reference:http://www.netlib.org/blas/zdotu.f
* ,
;
I nput
n
x
incx
y
incy
Out put
returns double-precision complex dot product (returns zero if n <= 0)
Error St at us
reduction buffer
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 77
NVI DI A
Funct ion cublasZdrot ( )
void
cublasZdrot (int n, cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy,
double c, double s)
Reference:http://www.netlib.org/blas/zdrot.f
;
I nput
n
x
incx
y
incy
c
double-precision cosine component of rotation matrix
s
double-precision sine component of rotation matrix
Out put
x
y
Error St at us
c s
s c
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
78 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZdscal( )
void
cublasZdscal (int n, double alpha, cuDoubleComplex *x,
int incx)
replacesdoubleprecisioncomplexvectorxwithdoubleprecision
complexalpha*x.
where
Reference:http://www.netlib.org/blas/zdscal.f
with ,
.
I nput
n
alpha
x
incx
Out put
x
double-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
lx 1 1 n ( ) incx * + =
PG- 05326- 032_V02 79
NVI DI A
Funct ion cublasZrot ( )
void
cublasZrot (int n, cuDoubleComplex *x, int incx,
cuDoubleComplex *y, int incy, double sc,
cuDoubleComplex cs)
Reference:http://netlib.org/lapack/explorehtml/zrot.f.html
;
I nput
n
x
incx
y
incy
sc
cs
double-precision complex sine component of rotation matrix
Out put
x
rotated double-precision complex vector x (unchanged if n <= 0)
y
rotated double-precision complex vector y (unchanged if n <= 0)
Error St at us
sc cs
cs sc
x
T
y
T
x lx i incx * + | |
lx 1 1 n ( ) incx * + =
80 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZrot g( )
void
cublasZrotg (cuDoubleComplex *host_ca,
cuDoubleComplex *host_cb, double *host_sc,
double *host_cs)
constructsthecomplexGivenstransformation
whichzerosthesecondentryofthecomplex2vector .
Thequantity overwritescainstorage.Thefunction
cublasCrot(n,x,incx,y,incy,sc,cs)normallyiscallednextto
applythetransformationtoa2nmatrix.Notethatthisfunctionis
providedforcompletenessandisrunexclusivelyonthehost.
Reference:http://www.netlib.org/blas/zrotg.f
Funct ion cublasZscal( )
void
cublasZscal (int n, cuDoubleComplex alpha,
cuDoubleComplex *x, int incx)
replacesdoubleprecisioncomplexvectorxwithdoubleprecision
complexalpha*x.
I nput
ca
double-precision complex scalar
cb
double-precision complex scalar
Out put
ca
double-precision complex
sc
cs
double-precision complex sine component of rotation matrix
G
sc cs
cs sc
, sc
2
cs
2
+ 1 = =
ca cb
T
ca ca
-
ca, cb
ca ca
-
ca, cb
PG- 05326- 032_V02 81
NVI DI A
where
Reference:http://www.netlib.org/blas/zscal.f
Funct ion cublasZswap( )
void
cublasZswap (int n, cuDoubleComplex *x, int incx,
interchangesdoubleprecisioncomplexvectorxwithdoubleprecision
complexvectory.Fori=0ton-1,itinterchanges
where
with ,
.
I nput
n
alpha
double-precision complex scalar multiplier
x
incx
Out put
x
double-precision complex result (unchanged if n <= 0 or incx <= 0)
Error St at us
lx 1 1 n ( ) incx * + =
with ,
;
lx 1 1 n ( ) incx * + =
82 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zswap.f
I nput
n
x
incx
y
incy
Out put
x
double-precision complex vector y (unchanged from input if n <= 0)
y
double-precision complex vector x (unchanged from input if n <= 0)
Error St at us
PG- 05326- 032_V02 83
NVI DI A
CH A P T E R
3
Si ngl e- Pr eci si on BLAS2
Funct i ons
TheLevel2BasicLinearAlgebraSubprograms(BLAS2)arefunctions
thatperformmatrixvectoroperations.TheCUBLASimplementations
ofsingleprecisionBLAS2functionsaredescribedinthesesections:
84 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ThesingleprecisionBLAS2functionsareasfollows:
FunctioncublasSgbmv()onpage 85
FunctioncublasSgemv()onpage 86
FunctioncublasSger()onpage 87
FunctioncublasSsbmv()onpage 88
FunctioncublasSspmv()onpage 90
FunctioncublasSspr()onpage 91
FunctioncublasSspr2()onpage 92
FunctioncublasSsymv()onpage 94
FunctioncublasSsyr()onpage 95
FunctioncublasSsyr2()onpage 96
FunctioncublasStbmv()onpage 98
FunctioncublasStbsv()onpage 99
FunctioncublasStpmv()onpage 101
FunctioncublasStpsv()onpage 102
FunctioncublasStrmv()onpage 104
FunctioncublasStrsv()onpage 105
PG- 05326- 032_V02 85
NVI DI A
CHAPTER 3 Single- Precision BLAS2 Funct ions
Funct ion cublasSgbmv( )
void
cublasSgbmv (char trans, int m, int n, int kl, int ku,
float alpha, const float *A, int lda,
const float *x, int incx, float beta,
float *y, int incy);
performsoneofthematrixvectoroperations
alphaandbetaaresingleprecisionscalars,andxandyaresingle
precisionvectors.Aisanmnbandmatrixconsistingofsingle
precisionelementswithklsubdiagonalsandkusuperdiagonals.
,
where or ,
I nput
trans
specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
m
the number of rows of matrix A; m must be at least zero.
n
the number of columns of matrix A; n must be at least zero.
kl
the number of subdiagonals of matrix A; kl must be at least zero.
ku
the number of superdiagonals of matrix A; ku must be at least zero.
alpha
single-precision scalar multiplier applied to op(A).
A
single-precision array of dimensions (lda, n). The leading
part of array A must contain the band matrix A,
supplied column by column, with the leading diagonal of the matrix in
row ku+1 of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row ku+2, and so
on. Elements in the array A that do not correspond to elements in the
band matrix (such as the top left kuku triangle) are not referenced.
lda
leading dimension of A; lda must be at least .
x
single-precision array of length at least
when trans == 'N' or 'n', and at least
otherwise.
incx
storage spacing between elements of x; incx must not be zero.
beta
single-precision scalar multiplier applied to vector y. If beta is zero, y
is not read.
y alpha op A ( ) x beta y * + * * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
kl ku 1 + + ( ) n
kl ku 1 + +
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
86 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sgbmv.f
Funct ion cublasSgemv( )
void
cublasSgemv (char trans, int m, int n, float alpha,
const float *A, int lda, const float *x,
int incx, float beta, float *y, int incy)
alphaandbetaaresingleprecisionscalars,andxandyaresingle
precisionvectors.Aisanmnmatrixconsistingofsingleprecision
elements.MatrixAisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarrayinwhichAisstored.
y
single-precision array of length at least
when trans == 'N' or 'n' and at least
otherwise. If beta is zero, y is not read.
incy
storage spacing between elements of y; incy must not be zero.
Out put
y
updated according to .
Error St at us
if m < 0, n < 0, kl < 0, ku < 0,
incx == 0, or incy == 0
I nput ( cont inued)
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
,
where or ,
I nput
trans
If trans == 'T', 't', 'C', or 'c', .
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 87
NVI DI A
Reference:http://www.netlib.org/blas/sgemv.f
Funct ion cublasSger( )
void
cublasSger (int m, int n, float alpha, const float *x,
int incx, const float *y, int incy, float *A,
int lda)
performsthesymmetricrank1operation
alpha
single-precision scalar multiplier applied to op(A).
A
single-precision array of dimensions (lda, n) if trans == 'N' or
'n', of dimensions (lda, m) otherwise; lda must be at least
max(1, m) if trans == 'N' or 'n' and at least max(1, n) otherwise.
lda
leading dimension of two-dimensional array used to store matrix A.
x
single-precision array of length at least if
trans == 'N' or 'n', else at least .
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
is not read.
y
single-precision array of length at least if
incy
the storage spacing between elements of y; incy must not be zero.
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
, A = alpha * x * y
T
+ A
88 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisionscalar,xisanmelementsingle
precisionvector,yisannelementsingleprecisionvector,andAisan
mnmatrixconsistingofsingleprecisionelements.MatrixAisstored
incolumnmajorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayusedtostoreA.
Reference:http://www.netlib.org/blas/sger.f
Funct ion cublasSsbmv( )
void
cublasSsbmv (char uplo, int n, int k, float alpha,
performsthematrixvectoroperation
I nput
m
specifies the number of rows of the matrix A; m must be at least zero.
n
alpha
single-precision scalar multiplier applied to .
x
single-precision array of length at least .
incx
the storage spacing between elements of x; incx must not be zero.
y
incy
A
single-precision array of dimensions (lda, n).
lda
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
, y alpha A x beta y * + * * =
PG- 05326- 032_V02 89
NVI DI A
wherealphaandbetaaresingleprecisionscalars,andxandyare
nelementsingleprecisionvectors.Aisannnsymmetricbandmatrix
consistingofsingleprecisionelements,withksuperdiagonalsandthe
samenumberofsubdiagonals.
I nput
uplo
specifies whether the upper or lower triangular part of the symmetric
band matrix A is being supplied. If uplo == 'U' or 'u', the upper
triangular part is being supplied. If uplo == 'L' or 'l', the lower
triangular part is being supplied.
n
specifies the number of rows and the number of columns of the
symmetric matrix A; n must be at least zero.
k
specifies the number of superdiagonals of matrix A. Since the matrix is
symmetric, this is also the number of subdiagonals; k must be at least
zero.
alpha
single-precision scalar multiplier applied to A * x.
A
single-precision array of dimensions (lda, n). When uplo == 'U' or
'u', the leading (k+1)n part of array A must contain the upper
triangular band of the symmetric matrix, supplied column by column,
with the leading diagonal of the matrix in row k+1 of the array, the
first superdiagonal starting at position 2 in row k, and so on. The top
left kk triangle of the array A is not referenced. When uplo == 'L'
or 'l', the leading (k+1)n part of the array A must contain the
lower triangular band part of the symmetric matrix, supplied column
by column, with the leading diagonal of the matrix in row 1 of the
array, the first subdiagonal starting at position 1 in row 2, and so on.
The bottom right kk triangle of the array A is not referenced.
lda
leading dimension of A; lda must be at least k+1.
x
incx
beta
is not read.
y
If beta is zero, y is not read.
incy
Out put
y
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
y alpha A x beta y * + * * =
90 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssbmv.f
Funct ion cublasSspmv( )
void
cublasSspmv (char uplo, int n, float alpha,
const float *AP, const float *x, int incx,
float beta, float *y, int incy)
nelementsingleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofsingleprecisionelementsandissuppliedinpackedform.
Error St at us
if k < 0, n < 0, incx == 0, or
incy == 0
,
I nput
uplo
specifies whether the matrix data is stored in the upper or the lower
triangular part of array AP. If uplo == 'U' or 'u', the upper triangular
part of A is supplied in AP. If uplo == 'L' or 'l', the lower triangular
part of A is supplied in AP.
n
the number of rows and columns of matrix A; n must be at least zero.
alpha
AP
single-precision array with at least elements. If
uplo == 'U' or 'u', array AP contains the upper triangular part of the
symmetric matrix A, packed sequentially, column by column; that is, if
i <= j, A[i,j] is stored in . If uplo == 'L'
or 'l', the array AP contains the lower triangular part of the
i >= j, A[i,j] is stored in .
x
incx
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 91
NVI DI A
Reference:http://www.netlib.org/blas/sspmv.f
Funct ion cublasSspr( )
void
cublasSspr (char uplo, int n, float alpha,
const float *x, int incx, float *AP)
wherealphaisasingleprecisionscalar,andxisannelementsingle
precisionvector.Aisasymmetricnnmatrixthatconsistsofsingle
precisionelementsandissuppliedinpackedform.
beta
is not read.
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incy ( ) * + ( )
,
I nput
uplo
n
alpha
x
A alpha x x
T
A + * * =
x x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
92 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sspr.f
Funct ion cublasSspr2( )
void
cublasSspr2 (char uplo, int n, float alpha,
const float *x, int incx, const float *y,
int incy, float *AP)
wherealphaisasingleprecisionscalar,andxandyarenelement
singleprecisionvectors.Aisasymmetricnnmatrixthatconsistsof
singleprecisionelementsandissuppliedinpackedform.
incx
AP
Out put
A
Error St at us
if n < 0 or incx == 0
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
T
A + * * =
, A alpha x y
T
alpha y x
T
A + * * + * * =
PG- 05326- 032_V02 93
NVI DI A
Reference:http://www.netlib.org/blas/sspr2.f
I nput
uplo
triangular part of array A. If uplo == 'U' or 'u', only the upper
triangular part of A may be referenced and the lower triangular part of
A is inferred. If uplo == 'L' or 'l', only the lower triangular part of
A may be referenced and the upper triangular part of A is inferred.
n
alpha
x
incx
y
incy
AP
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
x y
T
alpha y x
T
* * + *
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
T
alpha y x
T
A + * * + * * =
94 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSsymv( )
void
cublasSsymv (char uplo, int n, float alpha,
nelementsingleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofsingleprecisionelementsandisstoredineitherupperor
lowerstoragemode.
,
I nput
uplo
specifies whether the upper or lower triangular part of the array A is
referenced. If uplo == 'U' or 'u', the symmetric matrix A is stored in
upper storage mode; that is, only the upper triangular part of A is
referenced while the lower triangular part of A is inferred. If uplo ==
'L' or 'l', the symmetric matrix A is stored in lower storage mode;
that is, only the lower triangular part of A is referenced while the upper
triangular part of A is inferred.
n
alpha
A
single-precision array of dimensions (lda, n). If uplo == 'U' or 'u',
the leading nn upper triangular part of the array A must contain the
upper triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced. If uplo == 'L' or 'l', the
leading nn lower triangular part of the array A must contain the lower
triangular part of the symmetric matrix, and the strictly upper
triangular part of A is not referenced.
lda
leading dimension of A; lda must be at least max(1, n).
x
incx
beta
single-precision scalar multiplier applied to vector y.
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 95
NVI DI A
Reference:http://www.netlib.org/blas/ssymv.f
Funct ion cublasSsyr( )
void
cublasSsyr (char uplo, int n, float alpha,
const float *x, int incx, float *A, int lda)
wherealphaisasingleprecisionscalar,xisannelementsingle
precisionvector,andAisannnsymmetricmatrixconsistingofsingle
precisionelements.Aisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarraycontainingA.
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incy ( ) * + ( )
,
I nput
uplo
triangular part of A is referenced. If uplo == 'L' or 'l', only the
lower triangular part of A is referenced.
n
alpha
x
A alpha x x
T
A + * * =
x * x
T
1 n 1 ( ) abs incx ( ) * + ( )
96 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyr.f
Funct ion cublasSsyr2( )
void
cublasSsyr2 (char uplo, int n, float alpha,
const float *x, int incx, const float *y,
int incy, float *A, int lda)
wherealphaisasingleprecisionscalar,xandyarenelementsingle
precisionvectors,andAisannnsymmetricmatrixconsistingof
singleprecisionelements.
incx
A
A contains the upper triangular part of the symmetric matrix, and the
strictly lower triangular part is not referenced. If uplo == 'L' or 'l',
A contains the lower triangular part of the symmetric matrix, and the
strictly upper triangular part is not referenced.
lda
leading dimension of the two-dimensional array containing A;
lda must be at least max(1, n).
Out put
A
Error St at us
A alpha x x
T
A + * * =
, A alpha x y
T
alpha y x
T
A + * * + * * =
PG- 05326- 032_V02 97
NVI DI A
Reference:http://www.netlib.org/blas/ssyr2.f
I nput
uplo
triangular part of A is referenced and the lower triangular part of A is
inferred. If uplo == 'L' or 'l', only the lower triangular part of A is
referenced and the upper triangular part of A is inferred.
n
alpha
x
incx
y
incy
A
lda
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
x y
T
y x
T
* + *
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
alpha y x
T
A + * * + * * =
98 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSt bmv( )
void
cublasStbmv (char uplo, char trans, char diag, int n,
int k, const float *A, int lda, float *x,
int incx)
xisannelementsingleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularbandmatrixconsistingofsingle
precisionelements.
,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular band
matrix. If uplo == 'U' or 'u', A is an upper triangular band matrix. If
uplo == 'L' or 'l', A is a lower triangular band matrix.
trans specifies op(A). If trans == 'N' or 'n', .
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular. If diag == 'U' or
'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
not assumed to be unit triangular.
n
specifies the number of rows and columns of the matrix A; n must be
at least zero.
k
specifies the number of superdiagonals or subdiagonals. If uplo ==
'U' or 'u', k specifies the number of superdiagonals. If uplo == 'L'
or 'l' k specifies the number of subdiagonals; k must at least be zero.
A
single-precision array of dimension (lda, n). If uplo == 'U' or 'u',
the leading (k+1)n part of the array A must contain the upper
triangular band matrix, supplied column by column, with the leading
diagonal of the matrix in row k+1 of the array, the first superdiagonal
starting at position 2 in row k, and so on. The top left kk triangle of
the array A is not referenced. If uplo == 'L' or 'l', the leading
(k+1)n part of the array A must contain the lower triangular band
matrix, supplied column by column, with the leading diagonal of the
matrix in row 1 of the array, the first subdiagonal starting at position 1
in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 99
NVI DI A
Reference:http://www.netlib.org/blas/stbmv.f
Funct ion cublasSt bsv( )
void
cublasStbsv (char uplo, char trans, char diag, int n,
int k, const float *A, int lda, float *X,
int incx)
solvesoneofthesystemsofequations
bandxarenelementvectors,andAisannn,unitornonunit,upper
orlower,triangularbandmatrixwithk+1diagonals.
Notestforsingularityornearsingularityisincludedinthisfunction.
Suchtestsmustbeperformedbeforecallingthisfunction.
lda
is the leading dimension of A; lda must be at least k+1.
x
On entry, x contains the source vector. On exit, x is overwritten with
the result vector.
incx
Out put
x
Error St at us
if n < 0, k<0, or incx == 0
if function cannot allocate enough
internal scratch vector memory
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where or ,
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
100 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/stbsv.f
I nput
uplo
specifies whether the matrix is an upper or lower triangular band
matrix: If uplo == 'U' or 'u', A is an upper triangular band matrix. If
trans
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether A is unit triangular. If diag == 'U' or 'u', A is
assumed to be unit triangular; that is, diagonal elements are not read
and are assumed to be unity. If diag == 'N' or 'n', A is not assumed
to be unit triangular.
n
k
specifies the number of superdiagonals or subdiagonals.
If uplo == 'U' or 'u', k specifies the number of superdiagonals. If
uplo == 'L' or 'l', k specifies the number of subdiagonals; k must
be at least zero.
A
single-precision array of dimension (lda, n). If uplo == 'U' or 'u',
the leading (k+1)n part of the array A must contain the upper
matrix in row 1 of the array, the first sub-diagonal starting at position
1 in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x
On entry, x contains the n-element right-hand side vector b. On exit, it
is overwritten with the solution vector x.
incx
Out put
x
updated to contain the solution vector x that solves .
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 101
NVI DI A
Funct ion cublasSt pmv( )
void
cublasStpmv (char uplo, char trans, char diag, int n,
const float *AP, float *x, int incx)
unit,upperorlower,triangularmatrixconsistingofsingleprecision
elements.
Error St at us
if incx == 0 or n< 0
,
where or ,
I nput
uplo
specifies whether the matrix A is an upper or lower triangular matrix.
If uplo == 'U' or 'u', A is an upper triangular matrix.
If uplo == 'L' or 'l', A is a lower triangular matrix.
trans
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not matrix A is unit triangular.
If diag == 'U' or 'u', A is assumed to be unit triangular.
If diag == 'N' or 'n', A is not assumed to be unit triangular.
n
at least zero.
AP
uplo == 'U' or 'u', the array AP contains the upper triangular part of
the symmetric matrix A, packed sequentially, column by column; that
is, if i <= j, A[i,j] is stored in . If uplo ==
'L' or 'l', array AP contains the lower triangular part of the
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
102 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/stpmv.f
Funct ion cublasSt psv( )
void
cublasStpsv (char uplo, char trans, char diag, int n,
const float *AP, float *X, int incx)
bandxarenelementsingleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrix.
x
the result vector.
incx
Out put
x
Error St at us
if incx == 0 or n < 0
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where or ,
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 103
NVI DI A
Reference:http://www.netlib.org/blas/stpsv.f
I nput
uplo
specifies whether the matrix is an upper or lower triangular matrix. If
uplo == 'U' or 'u', A is an upper triangular matrix. If uplo == 'L'
or 'l', A is a lower triangular matrix.
trans
If trans == 'T', 't', 'C', or 'c', .
diag
n
at least zero.
AP
uplo == 'U' or 'u', array AP contains the upper triangular matrix A,
packed sequentially, column by column; that is, if i <= j, A[i,j] is
stored in . If uplo == 'L' or 'l', array AP
contains the lower triangular matrix A, packed sequentially, column by
column; that is, if i >= j, A[i,j] is stored in
. When diag == 'U' or 'u', the
diagonal elements of A are not referenced and are assumed to be unity.
x
On entry, x contains the n-element right-hand side vector b. On exit, it
is overwritten with the solution vector x.
incx
Out put
x
Error St at us
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
104 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasSt rmv( )
void
cublasStrmv (char uplo, char trans, char diag, int n,
const float *A, int lda, float *x, int incx)
unit,upperorlower,triangularmatrixconsistingofsingleprecision
elements.
,
where or ,
I nput
uplo
If uplo == 'L' or 'l', A is an lower triangular matrix.
trans
If trans == 'T', 't', 'C', or 'c', .
diag
specifies whether or not A is a unit triangular matrix. If diag == 'U'
or 'u', A is assumed to be unit triangular. If diag == 'N' or 'n', A is
n
at least zero.
A
the leading nn upper triangular part of the array A must contain the
upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading nn lower
triangular part of the array A must contain the lower triangular matrix,
and the strictly upper triangular part of A is not referenced. When
diag == 'U' or 'u', the diagonal elements of A are not referenced
either, but are assumed to be unity.
lda
x
the result vector.
incx
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 105
NVI DI A
Reference:http://www.netlib.org/blas/strmv.f
Funct ion cublasSt rsv( )
void
cublasStrsv (char uplo, char trans, char diag, int n,
const float *A, int lda, float *x, int incx)
solvesasystemofequations
bandxarenelementsingleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrixconsistingofsingle
precisionelements.MatrixAisstoredincolumnmajorformat,and
ldaistheleadingdimensionofthetwodimensionalarray
containingA.
Out put
x
Error St at us
x op A ( ) x * =
,
where or ,
I nput
uplo
triangular part of A may be referenced. If uplo == 'L' or 'l', only
the lower triangular part of A may be referenced.
trans
If trans == 'T', 't', 'C', or 'c', .
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
106 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strsv.f
diag
specifies whether or not A is a unit triangular matrix.
n
at least zero.
A
lda
x
On entry, x contains the n-element, right-hand-side vector b. On exit,
it is overwritten with the solution vector x.
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 107
NVI DI A
ThetwosingleprecisioncomplexBLAS2functionsareasfollows:
FunctioncublasCgbmv()onpage 108
FunctioncublasCgemv()onpage 109
FunctioncublasCgerc()onpage 111
FunctioncublasCgeru()onpage 112
FunctioncublasChbmv()onpage 113
FunctioncublasChemv()onpage 115
FunctioncublasCher()onpage 116
FunctioncublasCher2()onpage 117
FunctioncublasChpmv()onpage 119
FunctioncublasChpr()onpage 120
FunctioncublasChpr2()onpage 121
FunctioncublasCtbmv()onpage 123
FunctioncublasCtbsv()onpage 125
FunctioncublasCtpmv()onpage 126
FunctioncublasCtpsv()onpage 128
FunctioncublasCtrmv()onpage 129
FunctioncublasCtrsv()onpage 131
108 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCgbmv( )
void
cublasCgbmv (char trans, int m, int n, int kl, int ku,
cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *x, int incx,
cuComplex beta, cuComplex *y, int incy)
alphaandbetaaresingleprecisioncomplexscalars,andxandyare
singleprecisioncomplexvectors.Aisan mnbandmatrixconsisting
ofsingleprecisioncomplexelementswithklsubdiagonalsandku
superdiagonals.
,where
, ,or ;
I nput
trans
If trans == 'T' or 't', .
If trans == 'C', or 'c', .
m
n
kl
specifies the number of subdiagonals of matrix A; kl must be at least
zero.
ku
specifies the number of superdiagonals of matrix A; ku must be at
least zero.
alpha
single-precision complex scalar multiplier applied to op(A).
A
single-precision complex array of dimensions (lda, n). The leading
(kl+ku+1)n part of the array A must contain the band matrix A,
row (ku+1) of the array, the first superdiagonal starting at position 2 in
row ku, the first subdiagonal starting at position 1 in row (ku+2), and
so on. Elements in the array A that do not correspond to elements in
the band matrix (such as the top left kuku triangle) are not
referenced.
lda
leading dimension A; lda must be at least (kl+ku+1).
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 109
NVI DI A
Reference:http://www.netlib.org/blas/cgbmv.f
Funct ion cublasCgemv( )
void
cublasCgemv (char trans, int m, int n,
int lda, const cuComplex *x, int incx,
x
single-precision complex array of length at least
if trans == 'N' or 'n', else at least
.
incx
specifies the increment for the elements of x; incx must not be zero.
beta
single-precision complex scalar multiplier applied to vector y. If beta
is zero, y is not read.
y
. If beta is zero, y is not read.
incy
on entry, incy specifies the increment for the elements of y; incy
must not be zero.
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
,
where , ,or ;
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
110 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
alphaandbetaaresingleprecisioncomplexscalars;andxandyare
singleprecisioncomplexvectors.Aisanmnmatrixconsistingof
singleprecisioncomplexelements.MatrixAisstoredincolumnmajor
format,andldaistheleadingdimensionofthetwodimensionalarray
inwhichAisstored.
Reference:http://www.netlib.org/blas/cgemv.f
I nput
trans
If trans == 'C' or 'c', .
m
n
alpha
single-precision complex scalar multiplier applied to op(A).
A
single-precision complex array of dimensions (lda, n) if trans ==
'N' or 'n', of dimensions (lda, m) otherwise; lda must be at least
lda
x
.
incx
beta
single-precision complex scalar multiplier applied to vector y. If beta
y
.
incy
Out put
y
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 111
NVI DI A
Funct ion cublasCgerc( )
void
cublasCgerc (int m, int n, cuComplex alpha,
const cuComplex *x, int incx,
const cuComplex *y, int incy,
cuComplex *A, int lda)
wherealphaisasingleprecisioncomplexscalar,xisanmelement
singleprecisioncomplexvector,yisannelementsingleprecision
complexvector,andAisanmnmatrixconsistingofsingleprecision
complexelements.MatrixAisstoredincolumnmajorformat,andlda
istheleadingdimensionofthetwodimensionalarrayusedtostoreA.
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
,
I nput
m
n
alpha
single-precision complex scalar multiplier applied to .
x
.
incx
y
.
incy
A
single-precision complex array of dimensions (lda, n).
lda
A = alpha * x * y
H
+ A
x * y
H
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
112 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/cgerc.f
Funct ion cublasCgeru( )
void
cublasCgeru (int m, int n, cuComplex alpha,
wherealphaisasingleprecisioncomplexscalar,xisanmelement
singleprecisioncomplexvector,yisannelementsingleprecision
complexvector,andAisanmnmatrixconsistingofsingleprecision
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
A alpha x y
H
A + * * =
,
I nput
m
n
alpha
x
.
incx
y
.
A = alpha * x * y
T
+ A
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 113
NVI DI A
Reference:http://www.netlib.org/blas/cgeru.f
Funct ion cublasChbmv( )
void
cublasChbmv (char uplo, int n, int k, cuComplex alpha,
const cuComplex *A, int lda,
wherealphaandbetaaresingleprecisioncomplexscalars,andxand
yarenelementsingleprecisioncomplexvectors.AisaHermitiannn
bandmatrixthatconsistsofsingleprecisioncomplexelements,withk
superdiagonalsandthesamenumberofsubdiagonals.
incy
A
single-precision complex array of dimensions (lda, n).
lda
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
A alpha x y
T
A + * * =
,
I nput
uplo
specifies whether the upper or lower triangular part of the Hermitian
n
114 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chbmv.f
k
Hermitian, this is also the number of subdiagonals; k must be at least
zero.
alpha
single-precision complex scalar multiplier applied to A * x.
A
single-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading (k + 1)n part of array A must contain the
upper triangular band of the Hermitian matrix, supplied column by
column, with the leading diagonal of the matrix in row k + 1 of the
array, the first superdiagonal starting at position 2 in row k, and so on.
The top left kk triangle of array A is not referenced. When uplo ==
'L' or 'l', the leading (k + 1)n part of array A must contain the
lower triangular band part of the Hermitian matrix, supplied column
The bottom right kk triangle of array A is not referenced. The
imaginary parts of the diagonal elements need not be set; they are
assumed to be zero.
lda
leading dimension of A; lda must be at least k + 1.
x
.
incx
beta
single-precision complex scalar multiplier applied to vector y.
y
incy
Out put
y
Error St at us
if k < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 115
NVI DI A
Funct ion cublasChemv( )
void
cublasChemv (char uplo, int n, cuComplex alpha,
matrixthatconsistsofsingleprecisioncomplexelementsandisstored
ineitherupperorlowerstoragemode.
,
I nput
uplo
referenced. If uplo == 'U' or 'u', the Hermitian matrix A is stored in
'L' or 'l', the Hermitian matrix A is stored in lower storage mode;
n
alpha
A
'U' or 'u', the leading nn upper triangular part of the array A must
contain the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced. If uplo == 'L' or
'l', the leading nn lower triangular part of the array A must contain
the lower triangular part of the Hermitian matrix, and the strictly
upper triangular part of A is not referenced. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero.
lda
x
.
incx
1 n 1 ( ) abs incx ( ) * + ( )
116 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chemv.f
Funct ion cublasCher( )
void
cublasCher (char uplo, int n, float alpha,
const cuComplex *x, int incx, cuComplex *A,
int lda)
performstheHermitianrank1operation
precisioncomplexvector,andAisannnHermitianmatrixconsisting
ofsingleprecisioncomplexelements.Aisstoredincolumnmajor
containingA.
beta
single-precision complex scalar multiplier applied to vector y.
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incy ( ) * + ( )
,
I nput
uplo
n
A alpha x x
H
A + * * =
PG- 05326- 032_V02 117
NVI DI A
Reference:http://www.netlib.org/blas/cher.f
Funct ion cublasCher2( )
void
cublasCher2 (char uplo, int n, cuComplex alpha,
alpha
single-precision scalar multiplier applied to
.
x
.
incx
A
'U' or 'u', A contains the upper triangular part of the Hermitian
matrix, and the strictly lower triangular part is not referenced. If uplo
== 'L' or 'l', A contains the lower triangular part of the Hermitian
matrix, and the strictly upper triangular part is not referenced. The
imaginary parts of the diagonal elements need not be set, they are
assumed to be zero, and on exit they are set to zero.
lda
Out put
A
updated according to
.
Error St at us
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
A alpha x x
H
A + * * =
, A alpha x y
H
alpha y x
H
A + * * + * * =
118 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisioncomplexscalar,xandyaren
elementsingleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofsingleprecisioncomplexelements.
Reference:http://www.netlib.org/blas/cher2.f
I nput
uplo
n
alpha
single-precision complex scalar multiplier applied to
and whose conjugate is applied to .
x
incx
y
incy
A
lda
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
H
alpha y x
H
A + * * + * * =
PG- 05326- 032_V02 119
NVI DI A
Funct ion cublasChpmv( )
void
cublasChpmv (char uplo, int n, cuComplex alpha,
const cuComplex *AP, const cuComplex *x,
int incx, cuComplex beta, cuComplex *y,
int incy)
matrixthatconsistsofsingleprecisioncomplexelementsandis
suppliedinpackedform.
,
I nput
uplo
n
alpha
AP
single-precision complex array with at least elements.
If uplo == 'U' or 'u', array AP contains the upper triangular part of
the Hermitian matrix A, packed sequentially, column by column; that
'L' or 'l', the array AP contains the lower triangular part of the
Hermitian matrix A, packed sequentially, column by column; that is, if
i >= j, A[i,j] is stored in . The
assumed to be zero.
x
.
incx
beta
single-precision scalar multiplier applied to vector y.
y
incy
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
120 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/chpmv.f
Funct ion cublasChpr( )
void
cublasChpr (char uplo, int n, float alpha,
const cuComplex *x, int incx, cuComplex *AP)
ofsingleprecisioncomplexelementsthatissuppliedinpackedform.
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
,
I nput
uplo
n
alpha
x
.
A alpha x x
H
A + * * =
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 121
NVI DI A
Reference:http://www.netlib.org/blas/chpr.f
Funct ion cublasChpr2( )
void
cublasChpr2 (char uplo, int n, cuComplex alpha,
const cuComplex *y, int incy, cuComplex *AP)
wherealphaisasingleprecisioncomplexscalar,xandyaren
elementsingleprecisioncomplexvectors,andAisannnHermitian
incx
AP
Out put
A
Error St at us
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
H
A + * * =
, A alpha x y
H
alpha y x
H
A + * * + * * =
122 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
matrixconsistingofsingleprecisioncomplexelementsthatissupplied
inpackedform.
Reference:http://www.netlib.org/blas/chpr2.f
I nput
uplo
n
alpha
single-precision complex scalar multiplier applied to
x
.
incx
y
.
incy
AP
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
H
alpha y x
H
A + * * + * * =
PG- 05326- 032_V02 123
NVI DI A
Funct ion cublasCt bmv( )
void
cublasCtbmv (char uplo, char trans, char diag, int n,
int k, const cuComplex *A, int lda,
cuComplex *x, int incx)
xisannelementsingleprecisioncomplexvector,andAisannn,unit
ornonunit,upperorlower,triangularbandmatrixconsistingof
singleprecisioncomplexelements.
,
where , ,or ;
I nput
uplo
diag
n
at least zero.
k
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
124 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctbmv.f
A
single-precision complex array of dimension (lda, n). If uplo ==
'U' or 'u', the leading (k+1)n part of the array A must contain the
upper triangular band matrix, supplied column by column, with the
leading diagonal of the matrix in row k+1 of the array, the first
superdiagonal starting at position 2 in row k, and so on. The top left
kk triangle of the array A is not referenced. If uplo == 'L' or 'l',
the leading (k+1)n part of the array A must contain the lower
diagonal of the matrix in row 1 of the array, the first subdiagonal
starting at position 1 in row 2, and so on. The bottom right kk
triangle of the array is not referenced.
lda
x
.
the result vector.
incx
Out put
x
Error St at us
if incx == 0, k<0, or n < 0
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 125
NVI DI A
Funct ion cublasCt bsv( )
void
cublasCtbsv (char uplo, char trans, char diag, int n,
int k, const cuComplex *A, int lda,
cuComplex *X, int incx)
,
where , ,or ;
I nput
uplo
trans
diag
n
k
be at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
126 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctbsv.f
Funct ion cublasCt pmv( )
void
cublasCtpmv (char uplo, char trans, char diag, int n,
const cuComplex *AP, cuComplex *x, int incx)
A
single-precision complex array of dimension (lda, n). If uplo ==
diagonal of the matrix in row 1 of the array, the first sub-diagonal
x
.
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 127
NVI DI A
xisannelementsingleprecisioncomplexvector,andAisannn,unit
ornonunit,upperorlower,triangularmatrixconsistingofsingle
precisioncomplexelements.
Reference:http://www.netlib.org/blas/ctpmv.f
I nput
uplo
trans
diag
n
at least zero.
AP
If uplo == 'U' or 'u', the array AP contains the upper triangular part
of the symmetric matrix A, packed sequentially, column by column;
that is, if i <= j, A[i,j] is stored in . If
uplo == 'L' or 'l', array AP contains the lower triangular part of the
x
. On entry, x contains the source vector.
On exit, x is overwritten with the result vector.
incx
Out put
x
Error St at us
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
128 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCt psv( )
void
cublasCtpsv (char uplo, char trans, char diag, int n,
const cuComplex *AP, cuComplex *X, int incx)
bandxarenelementcomplexvectors,andAisannn,unitornon
unit,upperorlower,triangularmatrix.
Error St at us ( cont inued)
,
where , ,or ;
I nput
uplo
trans
diag
n
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 129
NVI DI A
Reference:http://www.netlib.org/blas/ctpsv.f
Funct ion cublasCt rmv( )
void
cublasCtrmv (char uplo, char trans, char diag, int n,
const cuComplex *A, int lda, cuComplex *x,
int incx)
xisannelementsingleprecisioncomplexvector;andAisannn,unit
ornonunit,upperorlower,triangularmatrixconsistingofsingle
AP
If uplo == 'U' or 'u', array AP contains the upper triangular matrix
A, packed sequentially, column by column; that is, if i <= j, A[i,j] is
x
.
incx
Out put
x
Error St at us
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
130 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctrmv.f
I nput
uplo
trans
diag
n
at least zero.
A
contain the upper triangular matrix, and the strictly lower triangular
part of A is not referenced. If uplo == 'L' or 'l', the leading nn
lower triangular part of the array A must contain the lower triangular
matrix, and the strictly upper triangular part of A is not referenced.
When diag == 'U' or 'u', the diagonal elements of A are not
referenced either, but are assumed to be unity.
lda
x
incx
Out put
x
Error St at us
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 131
NVI DI A
Funct ion cublasCt rsv( )
void
cublasCtrsv (char uplo, char trans, char diag, int n,
const cuComplex *A, int lda, cuComplex *x,
int incx)
bandxarenelementsingleprecisioncomplexvectors,andAisan
nn,unitornonunit,upperorlower,triangularmatrixconsistingof
singleprecisionelements.MatrixAisstoredincolumnmajorformat,
andldaistheleadingdimensionofthetwodimensionalarray
containingA.
,
where , ,or ;
I nput
uplo
trans
diag
n
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
132 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ctrsv.f
A
'U' or 'u', A contains the upper triangular part of the symmetric
== 'L' or 'l', A contains the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced.
lda
x
. On entry, x contains the n-element, right-
hand-side vector b. On exit, it is overwritten with solution vector x.
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 133
NVI DI A
CH A P T E R
4
Doubl e- Pr eci si on BLAS2
Funct i ons
TheLevel2BasicLinearAlgebraSubprograms(BLAS2)arefunctions
thatperformmatrixvectoroperations.TheCUBLASimplementations
ofdoubleprecisionBLAS2functionsaredescribedinthesesections:
DoublePrecisionComplexBLAS2functionsonpage 158
134 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
precisionhardware.
ThedoubleprecisionBLAS2functionsareasfollows:
FunctioncublasDgbmv()onpage 135
FunctioncublasDgemv()onpage 136
FunctioncublasDger()onpage 138
FunctioncublasDsbmv()onpage 139
FunctioncublasDspmv()onpage 141
FunctioncublasDspr()onpage 142
FunctioncublasDspr2()onpage 143
FunctioncublasDsymv()onpage 144
FunctioncublasDsyr()onpage 146
FunctioncublasDsyr2()onpage 147
FunctioncublasDtbmv()onpage 148
FunctioncublasDtbsv()onpage 150
FunctioncublasDtpmv()onpage 152
FunctioncublasDtpsv()onpage 153
FunctioncublasDtrmv()onpage 154
FunctioncublasDtrsv()onpage 156
PG- 05326- 032_V02 135
NVI DI A
CHAPTER 4 Double- Precision BLAS2 Funct ions
Funct ion cublasDgbmv( )
void
cublasDgbmv (char trans, int m, int n, int kl, int ku,
double alpha, const double *A, int lda,
const double *x, int incx, double beta,
double *y, int incy)
alphaandbetaaredoubleprecisionscalars,andxandyaredouble
precisionvectors.Aisan mnbandmatrixconsistingofdouble
precisionelementswithklsubdiagonalsandkusuperdiagonals.
,
where or ,
I nput
trans
If trans == 'T', 't', 'C', or 'c', .
m
n
kl
zero.
ku
least zero.
alpha
double-precision scalar multiplier applied to op(A).
A
double-precision array of dimensions (lda, n). The leading
referenced.
lda
x
double-precision array of length at least if
incx
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
136 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dgbmv.f
Funct ion cublasDgemv( )
void
cublasDgemv (char trans, int m, int n, double alpha,
const double *A, int lda, const double *x,
int incx, double beta, double *y, int incy)
alphaandbetaaredoubleprecisionscalars,andxandyaredouble
precisionvectors.Aisan mnmatrixconsistingofdoubleprecision
beta
double-precision scalar multiplier applied to vector y. If beta is zero,
y is not read.
y
trans == 'N' or 'n', else at least . If
beta is zero, y is not read.
incy
must not be zero.
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
,
where or ,
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 137
NVI DI A
elements.MatrixAisstoredincolumnmajorformat,andldaisthe
leadingdimensionofthetwodimensionalarrayinwhichAisstored.
Reference:http://www.netlib.org/blas/dgemv.f
I nput
trans
If trans == 'T', 't', 'C', or 'c', .
m
n
alpha
double-precision scalar multiplier applied to op(A).
A
double-precision array of dimensions (lda, n) if trans == 'N' or
'n', of dimensions (lda, m) otherwise; lda must be at least
lda
x
incx
beta
double-precision scalar multiplier applied to vector y. If beta is zero,
y is not read.
y
incy
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
138 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDger( )
void
cublasDger (int m, int n, double alpha, const double *x,
int incx, const double *y, int incy,
double *A, int lda)
wherealphaisadoubleprecisionscalar,xisanmelementdouble
precisionvector,yisannelementdoubleprecisionvector,andAisan
mnmatrixconsistingofdoubleprecisionelements.MatrixAisstored
incolumnmajorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayusedtostoreA.
Reference:http://www.netlib.org/blas/dger.f
,
I nput
m
n
alpha
double-precision scalar multiplier applied to .
x
double-precision array of length at least .
incx
y
incy
A
double-precision array of dimensions (lda, n).
lda
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
A = alpha * x * y
T
+ A
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
PG- 05326- 032_V02 139
NVI DI A
Funct ion cublasDsbmv( )
void
cublasDsbmv (char uplo, int n, int k, double alpha,
wherealphaandbetaaredoubleprecisionscalars,andxandyare
nelementdoubleprecisionvectors.Aisannnsymmetricbandmatrix
consistingofdoubleprecisionelements,withksuperdiagonalsand
thesamenumberofsubdiagonals.
,
I nput
uplo
specifies whether the upper or lower triangular part of the symmetric
n
k
symmetric, this is also the number of subdiagonals; k must be at least
zero.
alpha
double-precision scalar multiplier applied to A * x.
140 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsbmv.f
A
double-precision array of dimensions (lda, n). When uplo == 'U'
or 'u', the leading (k+1)n part of array A must contain the upper
triangular band of the symmetric matrix, supplied column by column,
with the leading diagonal of the matrix in row k+1 of the array, the
first superdiagonal starting at position 2 in row k, and so on. The top
left kk triangle of the array A is not referenced. When uplo == 'L'
or 'l', the leading (k+1)n part of the array A must contain the
lower triangular band part of the symmetric matrix, supplied column
The bottom right kk triangle of the array A is not referenced.
lda
leading dimension of A; lda must be at least k+1.
x
incx
beta
double-precision scalar multiplier applied to vector y.
y
incy
Out put
y
Error St at us
if k < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 141
NVI DI A
Funct ion cublasDspmv( )
void
cublasDspmv (char uplo, int n, double alpha,
const double *AP, const double *x, int incx,
double beta, double *y, int incy)
nelementdoubleprecisionvectors.Aisasymmetricnnmatrixthat
consistsofdoubleprecisionelementsandissuppliedinpackedform.
Reference:http://www.netlib.org/blas/dspmv.f
,
I nput
uplo
n
alpha
AP
double-precision array with at least elements. If
x
incx
beta
y
incy
Out put
y
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
142 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDspr( )
void
cublasDspr (char uplo, int n, double alpha,
const double *x, int incx, double *AP)
wherealphaisadoubleprecisionscalar,andxisannelement
doubleprecisionvector.Aisasymmetricnnmatrixthatconsistsof
doubleprecisionelementsandissuppliedinpackedform.
Error St at us
if n < 0, incx == 0, or incy == 0
,
I nput
uplo
n
alpha
x
incx
AP
A alpha x x
T
A + * * =
x x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
PG- 05326- 032_V02 143
NVI DI A
Reference:http://www.netlib.org/blas/dspr.f
Funct ion cublasDspr2( )
void
cublasDspr2 (char uplo, int n, double alpha,
const double *x, int incx, const double *y,
int incy, double *AP)
wherealphaisadoubleprecisionscalar,andxandyarenelement
doubleprecisionvectors.Aisasymmetricnnmatrixthatconsistsof
doubleprecisionelementsandissuppliedinpackedform.
Out put
A
Error St at us
A alpha x x
T
A + * * =
,
I nput
uplo
n
alpha
double-precision scalar multiplier applied to and .
x
incx
y
A alpha x y
T
alpha y x
T
A + * * + * * =
x y
T
* y x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
144 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dspr2.f
Funct ion cublasDsymv( )
void
cublasDsymv (char uplo, int n, double alpha,
nelementdoubleprecisionvectors.Aisasymmetricnnmatrixthat
incy
AP
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
T
alpha y x
T
A + * * + * * =
, y alpha A x beta y * + * * =
PG- 05326- 032_V02 145
NVI DI A
consistsofdoubleprecisionelementsandisstoredineitherupperor
lowerstoragemode.
Reference:http://www.netlib.org/blas/dsymv.f
I nput
uplo
referenced. If uplo == 'U' or 'u', the symmetric matrix A is stored in
'L' or 'l', the symmetric matrix A is stored in lower storage mode;
n
alpha
A
double-precision array of dimensions (lda, n). If uplo == 'U' or
'u', the leading nn upper triangular part of the array A must contain
the upper triangular part of the symmetric matrix, and the strictly
lower triangular part of A is not referenced. If uplo == 'L' or 'l',
the leading nn lower triangular part of the array A must contain the
lower triangular part of the symmetric matrix, and the strictly upper
triangular part of A is not referenced.
lda
x
incx
beta
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
146 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDsyr( )
void
cublasDsyr (char uplo, int n, double alpha,
const double *x, int incx, double *A,
int lda)
wherealphaisadoubleprecisionscalar,xisannelementdouble
precisionvector,andAisannnsymmetricmatrixconsistingof
doubleprecisionelements.Aisstoredincolumnmajorformat,and
containingA.
,
I nput
uplo
n
alpha
x
incx
A
'u', A contains the upper triangular part of the symmetric matrix, and
the strictly lower triangular part is not referenced. If uplo == 'L' or
'l', A contains the lower triangular part of the symmetric matrix, and
the strictly upper triangular part is not referenced.
lda
A alpha x x
T
A + * * =
x * x
T
1 n 1 ( ) abs incx ( ) * + ( )
PG- 05326- 032_V02 147
NVI DI A
Reference:http://www.netlib.org/blas/dsyr.f
Funct ion cublasDsyr2( )
void
cublasDsyr2 (char uplo, int n, double alpha,
const double *x, int incx, const double *y,
int incy, double *A, int lda)
wherealphaisadoubleprecisionscalar,xandyarenelement
doubleprecisionvectors,andAisannnsymmetricmatrixconsisting
ofdoubleprecisionelements.
Out put
A
Error St at us
A alpha x x
T
A + * * =
,
I nput
uplo
triangular part of A is referenced and the lower triangular part of A is
inferred. If uplo == 'L' or 'l', only the lower triangular part of A is
referenced and the upper triangular part of A is inferred.
n
alpha
double-precision scalar multiplier applied to and .
x
incx
y
A alpha x y
T
alpha y x
T
A + * * + * * =
x y
T
* y x
T
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
148 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyr2.f
Funct ion cublasDt bmv( )
void
cublasDtbmv (char uplo, char trans, char diag, int n,
int k, const double *A, int lda, double *x,
int incx)
xisannelementdoubleprecisionvector,andAisannn,unitornon
unit,upperorlower,triangularbandmatrixconsistingofdouble
precisionelements.
incy
A
lda
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
A alpha x y
T
alpha y x
T
A + * * + * * =
,
where or ,
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 149
NVI DI A
Reference:http://www.netlib.org/blas/dtbmv.f
I nput
uplo
If trans == 'T', 't', 'C', or 'c', .
diag
n
at least zero.
k
A
double-precision array of dimension (lda, n). If uplo == 'U' or
'u', the leading (k+1)n part of the array A must contain the upper
matrix in row 1 of the array, the first subdiagonal starting at position 1
in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
lda
x
the result vector.
incx
Out put
x
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
150 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDt bsv( )
void
cublasDtbsv (char uplo, char trans, char diag, int n,
int k, const double *A, int lda, double *X,
int incx)
Error St at us
if incx == 0, k<0, or n < 0
,
where or ,
I nput
uplo
trans
If trans == 'T', 't', 'C', or 'c', .
diag
n
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 151
NVI DI A
Reference:http://www.netlib.org/blas/dtbsv.f
k
be at least zero.
A
double-precision array of dimension (lda, n). If uplo == 'U' or
'u', the leading (k+1)n part of the array A must contain the upper
matrix in row 1 of the array, the first sub-diagonal starting at position
1 in row 2, and so on. The bottom right kk triangle of the array is not
referenced.
x
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
152 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDt pmv( )
void
cublasDtpmv (char uplo, char trans, char diag, int n,
const double *AP, double *x, int incx)
unit,upperorlower,triangularmatrixconsistingofdoubleprecision
elements.
,
where or ,
I nput
uplo
trans
If trans == 'T', 't', 'C', or 'c', .
diag
n
at least zero.
AP
uplo == 'U' or 'u', the array AP contains the upper triangular part of
the symmetric matrix A, packed sequentially, column by column; that
'L' or 'l', array AP contains the lower triangular part of the
x
the result vector.
incx
Out put
x
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 153
NVI DI A
Reference:http://www.netlib.org/blas/dtpmv.f
Funct ion cublasDt psv( )
void
cublasDtpsv (char uplo, char trans, char diag, int n,
const double *AP, double *X, int incx)
orlower,triangularmatrix.
Error St at us
,
where or ,
I nput
uplo
trans
If trans == 'T', 't', 'C', or 'c', .
diag
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
154 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtpsv.f
Funct ion cublasDt rmv( )
void
cublasDtrmv (char uplo, char trans, char diag, int n,
const double *A, int lda, double *x,
int incx)
n
at least zero.
AP
uplo == 'U' or 'u', array AP contains the upper triangular matrix A,
packed sequentially, column by column; that is, if i <= j, A[i,j] is
x
incx
Out put
x
Error St at us
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
,
where or ,
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 155
NVI DI A
unit,upperorlower,triangularmatrixconsistingofdoubleprecision
elements.
Reference:http://www.netlib.org/blas/dtrmv.f
I nput
uplo
trans
If trans == 'T', 't', 'C', or 'c', .
diag
n
at least zero.
A
'u', the leading nn upper triangular part of the array A must contain
the upper triangular matrix, and the strictly lower triangular part of A is
not referenced. If uplo == 'L' or 'l', the leading nn lower
either, but are assumed to be unity.
lda
x
the result vector.
incx
Out put
x
Error St at us
op A ( ) A =
op A ( ) A
T
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
156 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasDt rsv( )
void
cublasDtrsv (char uplo, char trans, char diag, int n,
const double *A, int lda, double *x,
int incx)
bandxarenelementdoubleprecisionvectors,andAisannn,unitor
nonunit,upperorlower,triangularmatrixconsistingofdouble
precisionelements.MatrixAisstoredincolumnmajorformat,and
containingA.
,
where or ,
I nput
uplo
trans
If trans == 'T', 't', 'C', or 'c', .
diag
n
at least zero.
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
=
op A ( ) A =
op A ( ) A
T
=
PG- 05326- 032_V02 157
NVI DI A
Reference:http://www.netlib.org/blas/dtrsv.f
A
lda
x
On entry, x contains the n-element, right-hand-side vector b. On exit,
it is overwritten with the solution vector x.
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
158 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Double- Precision Complex BLAS2 funct ions
precisionhardware.
TwodoubleprecisioncomplexBLAS2functionsareimplemented:
FunctioncublasZgbmv()onpage 159
FunctioncublasZgemv()onpage 161
FunctioncublasZgerc()onpage 162
FunctioncublasZgeru()onpage 163
FunctioncublasZhbmv()onpage 165
FunctioncublasZhemv()onpage 167
FunctioncublasZher()onpage 168
FunctioncublasZher2()onpage 170
FunctioncublasZhpmv()onpage 171
FunctioncublasZhpr()onpage 173
FunctioncublasZhpr2()onpage 174
FunctioncublasZtbmv()onpage 175
FunctioncublasZtbsv()onpage 177
FunctioncublasZtpmv()onpage 179
FunctioncublasZtpsv()onpage 180
FunctioncublasZtrmv()onpage 182
FunctioncublasZtrsv()onpage 183
PG- 05326- 032_V02 159
NVI DI A
Funct ion cublasZgbmv( )
void
cublasZgbmv (char trans, int m, int n, int kl, int ku,
cuDoubleComplex alpha,
const cuDoubleComplex *A, int lda,
cuDoubleComplex beta, cuDoubleComplex *y,
int incy)
alphaandbetaaredoubleprecisioncomplexscalars,andxandyare
doubleprecisioncomplexvectors.Aisan mnbandmatrixconsisting
ofdoubleprecisioncomplexelementswithklsubdiagonalsandku
superdiagonals.
,where
, ,or ;
I nput
trans
m
n
kl
zero.
ku
least zero.
alpha
double-precision complex scalar multiplier applied to op(A).
A
double-precision complex array of dimensions (lda, n). The leading
referenced.
lda
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
160 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zgbmv.f
x
double-precision complex array of length at least
.
incx
beta
double-precision complex scalar multiplier applied to vector y. If beta
y
incy
must not be zero.
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 161
NVI DI A
Funct ion cublasZgemv( )
void
cublasZgemv (char trans, int m, int n,
int incy)
alphaandbetaaredoubleprecisioncomplexscalars;andxandyare
doubleprecisioncomplexvectors.Aisanmnmatrixconsistingof
doubleprecisioncomplexelements.MatrixAisstoredincolumn
majorformat,andldaistheleadingdimensionofthetwo
dimensionalarrayinwhichAisstored.
,
where , ,or ;
I nput
trans
m
n
alpha
double-precision complex scalar multiplier applied to op(A).
A
double-precision complex array of dimensions (lda, n) if trans ==
'N' or 'n', of dimensions (lda, m) otherwise; lda must be at least
lda
x
.
incx
beta
double-precision complex scalar multiplier applied to vector y. If beta
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
1 m 1 ( ) abs incx ( ) * + ( )
162 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zgemv.f
Funct ion cublasZgerc( )
void
cublasZgerc (int m, int n, cuDoubleComplex alpha,
const cuDoubleComplex *y, int incy,
cuDoubleComplex *A, int lda)
wherealphaisadoubleprecisioncomplexscalar,xisanmelement
doubleprecisioncomplexvector,yisannelementdoubleprecision
complexvector,andAisanmnmatrixconsistingofdoubleprecision
y
.
incy
Out put
y
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
1 m 1 ( ) abs incy ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
, A = alpha * x * y
H
+ A
PG- 05326- 032_V02 163
NVI DI A
Reference:http://www.netlib.org/blas/zgerc.f
Funct ion cublasZgeru( )
void
cublasZgeru (int m, int n, cuDoubleComplex alpha,
I nput
m
n
alpha
double-precision complex scalar multiplier applied to .
x
.
incx
y
.
incy
A
double-precision complex array of dimensions (lda, n).
lda
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
x * y
H
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
H
A + * * =
, A = alpha * x * y
T
+ A
164 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisadoubleprecisioncomplexscalar,xisanmelement
doubleprecisioncomplexvector,yisannelementdoubleprecision
complexvector,andAisanmnmatrixconsistingofdoubleprecision
Reference:http://www.netlib.org/blas/zgeru.f
I nput
m
n
alpha
x
.
incx
y
.
incy
A
double-precision complex array of dimensions (lda, n).
lda
Out put
A
Error St at us
if m < 0, n < 0, incx == 0, or
incy == 0
x * y
T
1 m 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
A alpha x y
T
A + * * =
PG- 05326- 032_V02 165
NVI DI A
Funct ion cublasZhbmv( )
void
cublasZhbmv (char uplo, int n, int k,
int incy)
wherealphaandbetaaredoubleprecisioncomplexscalars,andx
andyarenelementdoubleprecisioncomplexvectors.Aisa
Hermitiannnbandmatrixthatconsistsofdoubleprecisioncomplex
elements,withksuperdiagonalsandthesamenumberof
subdiagonals.
,
I nput
uplo
specifies whether the upper or lower triangular part of the Hermitian
n
k
Hermitian, this is also the number of subdiagonals; k must be at least
zero.
alpha
double-precision complex scalar multiplier applied to A * x.
166 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhbmv.f
A
double-precision complex array of dimensions (lda, n). If uplo ==
'U' or 'u', the leading (k + 1)n part of array A must contain the
upper triangular band of the Hermitian matrix, supplied column by
column, with the leading diagonal of the matrix in row k + 1 of the
array, the first superdiagonal starting at position 2 in row k, and so on.
The top left kk triangle of array A is not referenced. When uplo ==
'L' or 'l', the leading (k + 1)n part of array A must contain the
lower triangular band part of the Hermitian matrix, supplied column
The bottom right kk triangle of array A is not referenced. The
assumed to be zero.
lda
leading dimension of A; lda must be at least k + 1.
x
.
incx
beta
double-precision complex scalar multiplier applied to vector y.
y
incy
Out put
y
Error St at us
if k < 0, n < 0, incx == 0, or
incy == 0
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 167
NVI DI A
Funct ion cublasZhemv( )
void
cublasZhemv (char uplo, int n, cuDoubleComplex alpha,
int incy)
Hermitiannnmatrixthatconsistsofdoubleprecisioncomplex
elementsandisstoredineitherupperorlowerstoragemode.
,
I nput
uplo
referenced. If uplo == 'U' or 'u', the Hermitian matrix A is stored in
'L' or 'l', the Hermitian matrix A is stored in lower storage mode;
n
alpha
A
contain the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced. If uplo == 'L' or
'l', the leading nn lower triangular part of the array A must contain
upper triangular part of A is not referenced. The imaginary parts of the
lda
x
.
incx
1 n 1 ( ) abs incx ( ) * + ( )
168 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhemv.f
Funct ion cublasZher( )
void
cublasZher (char uplo, int n, double alpha,
ofdoubleprecisioncomplexelements.Aisstoredincolumnmajor
containingA.
beta
double-precision complex scalar multiplier applied to vector y.
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
1 n 1 ( ) abs incy ( ) * + ( )
, A alpha x x
H
A + * * =
PG- 05326- 032_V02 169
NVI DI A
Reference:http://www.netlib.org/blas/zher.f
I nput
uplo
n
alpha
x
.
incx
A
lda
Out put
A
Error St at us
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
A alpha x x
H
A + * * =
170 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZher2( )
void
cublasZher2 (char uplo, int n, cuDoubleComplex alpha,
wherealphaisadoubleprecisioncomplexscalar,xandyaren
elementdoubleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofdoubleprecisioncomplexelements.
,
I nput
uplo
n
alpha
double-precision complex scalar multiplier applied to
x
incx
y
incy
A
lda
A alpha x y
H
alpha y x
H
A + * * + * * =
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 171
NVI DI A
Reference:http://www.netlib.org/blas/zher2.f
Funct ion cublasZhpmv( )
void
cublasZhpmv (char uplo, int n, cuDoubleComplex alpha,
const cuDoubleComplex *AP,
cuDoubleComplex beta,
Hermitiannnmatrixthatconsistsofdoubleprecisioncomplex
elementsandissuppliedinpackedform.
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
A alpha x y
H
alpha y x
H
A + * * + * * =
,
I nput
uplo
n
alpha
172 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zhpmv.f
AP
double-precision complex array with at least elements.
assumed to be zero.
x
.
incx
beta
y
incy
Out put
y
Error St at us
if n < 0, incx == 0, or incy == 0
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 173
NVI DI A
Funct ion cublasZhpr( )
void
cublasZhpr (char uplo, int n, double alpha,
cuDoubleComplex *AP)
ofdoubleprecisioncomplexelementsthatissuppliedinpackedform.
Reference:http://www.netlib.org/blas/zhpr.f
,
I nput
uplo
n
alpha
x
.
incx
AP
Out put
A
A alpha x x
H
A + * * =
x * x
H
1 n 1 ( ) abs incx ( ) * + ( )
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x x
H
A + * * =
174 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZhpr2( )
void
cublasZhpr2 (char uplo, int n, cuDoubleComplex alpha,
cuDoubleComplex *AP)
wherealphaisadoubleprecisioncomplexscalar,xandyaren
elementdoubleprecisioncomplexvectors,andAisannnHermitian
matrixconsistingofdoubleprecisioncomplexelementsthatis
suppliedinpackedform.
Error St at us
,
I nput
uplo
n
alpha
double-precision complex scalar multiplier applied to
x
.
incx
y
.
A alpha x y
H
alpha y x
H
A + * * + * * =
x y
H
* y x
H
*
1 n 1 ( ) abs incx ( ) * + ( )
1 n 1 ( ) abs incy ( ) * + ( )
PG- 05326- 032_V02 175
NVI DI A
Reference:http://www.netlib.org/blas/zhpr2.f
Funct ion cublasZt bmv( )
void
cublasZtbmv (char uplo, char trans, char diag, int n,
int k, const cuDoubleComplex *A, int lda,
incy
AP
Out put
A
Error St at us
if n < 0, incx == 0, or incy == 0
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
A alpha x y
H
alpha y x
H
A + * * + * * =
,
where , ,or ;
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
176 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
xisannelementdoubleprecisioncomplexvector,andAisannn,
unitornonunit,upperorlower,triangularbandmatrixconsistingof
doubleprecisioncomplexelements.
I nput
uplo
diag
n
at least zero.
k
A
double-precision complex array of dimension (lda, n). If uplo ==
diagonal of the matrix in row 1 of the array, the first subdiagonal
lda
x
.
the result vector.
incx
Out put
x
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
PG- 05326- 032_V02 177
NVI DI A
Reference:http://www.netlib.org/blas/ztbmv.f
Funct ion cublasZt bsv( )
void
cublasZtbsv (char uplo, char trans, char diag, int n,
int k, const cuDoubleComplex *A, int lda,
cuDoubleComplex *X, int incx)
Error St at us
if incx == 0, k<0, or n < 0
,
where , ,or ;
I nput
uplo
trans
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
178 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztbsv.f
diag
n
k
be at least zero.
A
double-precision complex array of dimension (lda, n). If uplo ==
diagonal of the matrix in row 1 of the array, the first sub-diagonal
x
.
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 179
NVI DI A
Funct ion cublasZt pmv( )
void
cublasZtpmv (char uplo, char trans, char diag, int n,
xisannelementdoubleprecisioncomplexvector,andAisannn,
unitornonunit,upperorlower,triangularmatrixconsistingof
,
where , ,or ;
I nput
uplo
trans
diag
n
at least zero.
AP
If uplo == 'U' or 'u', the array AP contains the upper triangular part
of the symmetric matrix A, packed sequentially, column by column;
that is, if i <= j, A[i,j] is stored in . If
uplo == 'L' or 'l', array AP contains the lower triangular part of the
x
incx
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
180 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztpmv.f
Funct ion cublasZt psv( )
void
cublasZtpsv (char uplo, char trans, char diag, int n,
cuDoubleComplex *X, int incx)
bandxarenelementcomplexvectors,andAisannn,unitornon
unit,upperorlower,triangularmatrix.
Out put
x
Error St at us
x op A ( ) x * =
,
where , ,or ;
I nput
uplo
trans
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 181
NVI DI A
Reference:http://www.netlib.org/blas/ztpsv.f
diag
n
at least zero.
AP
If uplo == 'U' or 'u', array AP contains the upper triangular matrix
A, packed sequentially, column by column; that is, if i <= j, A[i,j] is
x
.
incx
Out put
x
Error St at us
n n 1 + ( ) * ( ) 2
AP i j j 1 + ( ) 2 * ( ) + | |
AP i 2 n j 1 + * ( ) j * ( ) 2 + | |
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
182 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZt rmv( )
void
cublasZtrmv (char uplo, char trans, char diag, int n,
xisannelementdoubleprecisioncomplexvector;andAisannn,
unitornonunit,upperorlower,triangularmatrixconsistingof
,
where , ,or ;
I nput
uplo
trans
diag
n
at least zero.
A
part of A is not referenced. If uplo == 'L' or 'l', the leading nn
lower triangular part of the array A must contain the lower triangular
matrix, and the strictly upper triangular part of A is not referenced.
When diag == 'U' or 'u', the diagonal elements of A are not
referenced either, but are assumed to be unity.
lda
x op A ( ) x * =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 183
NVI DI A
Reference:http://www.netlib.org/blas/ztrmv.f
Funct ion cublasZt rsv( )
void
cublasZtrsv (char uplo, char trans, char diag, int n,
bandxarenelementdoubleprecisioncomplexvectors,andAisan
nn,unitornonunit,upperorlower,triangularmatrixconsistingof
singleprecisionelements.MatrixAisstoredincolumnmajorformat,
andldaistheleadingdimensionofthetwodimensionalarray
containingA.
x
incx
Out put
x
Error St at us
1 n 1 ( ) abs incx ( ) * + ( )
x op A ( ) x * =
,
where , ,or ;
op A ( ) x * b =
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
184 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztrsv.f
I nput
uplo
trans
diag
n
at least zero.
A
'U' or 'u', A contains the upper triangular part of the symmetric
== 'L' or 'l', A contains the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced.
lda
x
. On entry, x contains the n-element, right-
hand-side vector b. On exit, it is overwritten with solution vector x.
incx
Out put
x
Error St at us
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
1 n 1 ( ) abs incx ( ) * + ( )
op A ( ) x * b =
PG- 05326- 032_V02 185
NVI DI A
CH A P T E R
5
BLAS3 Funct i ons
Level3BasicLinearAlgebraSubprograms(BLAS3)performmatrix
matrixoperations.TheCUBLASimplementationsaredescribedinthe
followingsections:
DoublePrecisionComplexBLAS3Functionsonpage 231
186 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
ThesingleprecisionBLAS3functionsarelistedbelow:
FunctioncublasSgemm()onpage 187
FunctioncublasSsymm()onpage 188
FunctioncublasSsyrk()onpage 190
FunctioncublasSsyr2k()onpage 192
FunctioncublasStrmm()onpage 194
FunctioncublasStrsm()onpage 196
PG- 05326- 032_V02 187
NVI DI A
Funct ion cublasSgemm( )
void
cublasSgemm (char transa, char transb, int m, int n,
int k, float alpha, const float *A, int lda,
const float *B, int ldb, float beta,
float *C, int ldc)
computestheproductofmatrixAandmatrixB,multipliestheresult
byscalaralpha,andaddsthesumtotheproductofmatrixCand
scalarbeta.Itperformsoneofthematrixmatrixoperations:
andalphaandbetaaresingleprecisionscalars.A,B,andCare
matricesconsistingofsingleprecisionelements,withop(A)anmk
matrix,op(B)aknmatrix,andCanmnmatrix.MatricesA,B,andC
arestoredincolumnmajorformat,andlda,ldb,andldcarethe
leadingdimensionsofthetwodimensionalarrayscontainingA,B,
and C.
,
where or ,
I nput
transa
specifies op(A). If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
transb
specifies op(B). If transb == 'N' or 'n', .
If transb == 'T', 't', 'C', or 'c', .
m
number of rows of matrix op(A) and rows of matrix C; m must be at
least zero.
n
number of columns of matrix op(B) and number of columns of C;
n must be at least zero.
k
number of columns of matrix op(A) and number of rows of op(B);
k must be at least zero.
alpha
A
single-precision array of dimensions (lda, k) if transa == 'N' or
'n', and of dimensions (lda, m) otherwise. If transa == 'N' or
'n', lda must be at least max(1, m); otherwise, lda must be at least
max(1, k).
lda
C alpha * op(A) * op(B) beta * C + =
op X ( ) X = op X ( ) X
T
=
op A ( ) A =
op A ( ) A
T
=
op B ( ) B =
op B ( ) B
T
=
op A ( ) op B ( ) *
188 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/sgemm.f
Funct ion cublasSsymm( )
void
cublasSsymm (char side, char uplo, int m, int n,
float *C, int ldc)
performsoneofthematrixmatrixoperations
wherealphaandbetaaresingleprecisionscalars,Aisasymmetric
matrixconsistingofsingleprecisionelementsandisstoredineither
B
single-precision array of dimensions (ldb, n) if transb == 'N' or
'n', and of dimensions (ldb, k) otherwise. If transb == 'N' or
'n', ldb must be at least max(1, k); otherwise, ldb must be at least
max(1, n).
ldb
leading dimension of two-dimensional array used to store matrix B.
beta
single-precision scalar multiplier applied to C. If zero, C does not have
to be a valid input.
C
single-precision array of dimensions (ldc, n); ldc must be at least
max (1, m).
ldc
leading dimension of two-dimensional array used to store matrix C.
Out put
C
updated based on .
Error St at us
if m < 0, n < 0, or k < 0
or , C alpha A B beta C * + * * = C alpha B A beta C * + * * =
PG- 05326- 032_V02 189
NVI DI A
lowerorupperstoragemode.BandCaremnmatricesconsistingof
singleprecisionelements.
I nput
side
specifies whether the symmetric matrix A appears on the left-hand side
or right-hand side of matrix B.
If side == 'L' or 'l', .
If side == 'R' or 'r', .
uplo
specifies whether the symmetric matrix A is stored in upper or lower
storage mode. If uplo == 'U' or 'u', only the upper triangular part
of the symmetric matrix is referenced, and the elements of the strictly
lower triangular part are inferred from those in the upper triangular
part. If uplo == 'L' or 'l', only the lower triangular part of the
symmetric matrix is referenced, and the elements of the strictly upper
triangular part are inferred from those in the lower triangular part.
m
specifies the number of rows of matrix C, and the number of rows of
matrix B. It also specifies the dimensions of symmetric matrix A when
side == 'L' or 'l'; m must be at least zero.
n
specifies the number of columns of matrix C, and the number of
columns of matrix B. It also specifies the dimensions of symmetric
matrix A when side == 'R' or 'r''; n must be at least zero.
alpha
single-precision scalar multiplier applied to A * B or B * A.
A
single-precision array of dimensions (lda, ka), where ka is m when
side == 'L' or 'l' and is n otherwise. If side == 'L' or 'l', the
leading mm part of array A must contain the symmetric matrix such
that when uplo == 'U' or 'u', the leading mm part stores the upper
triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced; and when uplo == 'L' or 'l',
the leading mm part stores the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced. If
side == 'R' or 'r', the leading nn part of array A must contain the
symmetric matrix such that when uplo == 'U' or 'u', the leading
nn part stores the upper triangular part of the symmetric matrix, and
the strictly lower triangular part of A is not referenced; and when
uplo == 'L' or 'l', the leading nn part stores the lower triangular
part of the symmetric matrix, and the strictly upper triangular part is
not referenced.
lda
leading dimension of A. When side == 'L' or 'l', it must be at least
max(1, m) and at least max(1, n) otherwise.
C alpha A B beta C * + * * =
C alpha B A beta C * + * * =
190 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssymm.f
Funct ion cublasSsyrk( )
void
cublasSsyrk (char uplo, char trans, int n, int k,
float beta, float *C, int ldc)
performsoneofthesymmetricrankkoperations
wherealphaandbetaaresingleprecisionscalars.Cisannn
symmetricmatrixconsistingofsingleprecisionelementsandisstored
ineitherlowerorupperstoragemode.Aisamatrixconsistingof
singleprecisionelementswithdimensionsofnkinthefirstcaseand
kninthesecondcase.
B
single-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B.
ldb
leading dimension of B; ldb must be at least max(1, m).
beta
single-precision scalar multiplier applied to C. If beta is zero, C does
not have to be a valid input.
C
single-precision array of dimensions (ldc, n).
ldc
leading dimension of C; ldc must be at least max(1, m).
Out put
C
updated according to or
.
Error St at us
if m < 0 or n < 0
or , C alpha A A
T
beta C * + * * = C alpha A
T
A * beta C * + * =
PG- 05326- 032_V02 191
NVI DI A
I nput
uplo
specifies whether the symmetric matrix C is stored in upper or lower
trans
specifies the operation to be performed. If trans == 'N' or 'n',
. If trans == 'T', 't', 'C', or 'c',
.
n
specifies the number of rows and the number columns of matrix C. If
trans == 'N' or 'n', n specifies the number of rows of matrix A. If
trans == 'T', 't', 'C', or 'c', n specifies the number of columns
of matrix A; n must be at least zero.
k
If trans == 'N' or 'n', k specifies the number of columns of
matrix A. If trans == 'T', 't', 'C', or 'c', k specifies the number
of rows of matrix A; k must be at least zero.
alpha
single-precision scalar multiplier applied to or .
A
single-precision array of dimensions (lda, ka), where ka is k when
trans == 'N' or 'n' and is n otherwise. When trans == 'N' or
'n', the leading nk part of array A contains the matrix A; otherwise,
the leading kn part of the array contains the matrix A.
lda
leading dimension of A. When trans == 'N' or 'n', lda must be at
least max(1, n). Otherwise lda must be at least max(1, k).
beta
single-precision scalar multiplier applied to C.
If beta is zero, C is not read.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
192 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyrk.f
Funct ion cublasSsyr2k( )
void
cublasSsyr2k (char uplo, char trans, int n, int k,
float *C, int ldc)
performsoneofthesymmetricrank2koperations
wherealphaandbetaaresingleprecisionscalars.Cisannn
symmetricmatrixconsistingofsingleprecisionelementsandisstored
ineitherlowerorupperstoragemode.AandBarematricesconsisting
C
single-precision array of dimensions (ldc, n). If uplo == 'U' or 'u',
the leading nn triangular part of the array C must contain the upper
triangular part of the symmetric matrix C, and the strictly lower
triangular part of C is not referenced. On exit, the upper triangular part
of C is overwritten by the upper triangular part of the updated matrix.
If uplo == 'L' or 'l', the leading nn triangular part of the array C
must contain the lower triangular part of the symmetric matrix C, and
the strictly upper triangular part of C is not referenced. On exit, the
lower triangular part of C is overwritten by the lower triangular part of
the updated matrix.
ldc
leading dimension of C; ldc must be at least max(1, n).
Out put
C
.
Error St at us
if n < 0 or k < 0
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 193
NVI DI A
ofsingleprecisionelementswithdimensionofnkinthefirstcaseand
kninthesecondcase.
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
n
trans == 'N' or 'n'', n specifies the number of rows of matrix A. If
k
If trans == 'N' or 'n', k specifies the number of columns of matrix
A. If trans == 'T', 't', 'C', or 'c', k specifies the number of rows
of matrix A; k must be at least zero.
alpha
single-precision scalar multiplier.
A
single-precision array of dimensions (lda, ka), where ka is k when
'n', the leading nk part of array A must contain the matrix A,
otherwise the leading kn part of the array must contain the matrix A.
lda
B
single-precision array of dimensions (ldb, kb), where kb = k when
trans == 'N' or 'n', and k = n otherwise. When trans == 'N' or
'n', the leading nk part of array B must contain the matrix B,
otherwise the leading kn part of the array must contain the matrix B.
ldb
leading dimension of B. When trans == 'N' or 'n', ldb must be at
least max(1, n). Otherwise ldb must be at least max(1, k).
beta
single-precision scalar multiplier applied to C. If beta is zero, C does
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
194 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ssyr2k.f
Funct ion cublasSt rmm( )
void
cublasStrmm (char side, char uplo, char transa,
char diag, int m, int n, float alpha,
const float *A, int lda, const float *B,
int ldb)
C
single-precision array of dimensions (ldc, n). If uplo == 'U' or 'u',
the leading nn triangular part of the array C must contain the upper
triangular part of the symmetric matrix C, and the strictly lower
the updated matrix.
ldc
leading dimension of C; idc must be at least max(1, n).
Out put
C
or
.
Error St at us
if n < 0 or k < 0
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where or ,
B alpha op A ( ) B * * = B alpha B op * A ( ) * =
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 195
NVI DI A
alphaisasingleprecisionscalar,Bisanmnmatrixconsistingof
singleprecisionelements,andAisaunitornonunit,upperorlower
triangularmatrixconsistingofsingleprecisionelements.
MatricesAandBarestoredincolumnmajorformat,andldaandldb
aretheleadingdimensionsofthetwodimensionalarraysthatcontain
AandB,respectively.
I nput
side
specifies whether op(A) multiplies B from the left or right.
uplo
transa
specifies the form of op(A) to be used in the matrix multiplication.
If transa == 'N' or 'n', .
If transa == 'T', 't', 'C', or 'c', .
diag
m
the number of rows of matrix B; m must be at least zero.
n
the number of columns of matrix B; n must be at least zero.
alpha
single-precision scalar multiplier applied to or ,
respectively. If alpha is zero, no accesses are made to matrix A, and
no read accesses are made to matrix B.
A
single-precision array of dimensions (lda, k). If side == 'L' or 'l',
k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or 'u', the
leading kk upper triangular part of the array A must contain the
not referenced. If uplo == 'L' or 'l', the leading kk lower
and are assumed to be unity.
lda
B alpha op A ( ) B * * =
B alpha B op A ( ) * * =
op A ( ) A =
op A ( ) A
T
=
op(A)*B B*op(A)
196 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strmm.f
Funct ion cublasSt rsm( )
void
cublasStrsm (char side, char uplo, char transa,
char diag, int m, int n, float alpha,
const float *A, int lda, float *B, int ldb)
solvesoneofthematrixequations
alphaisasingleprecisionscalar,andXandBaremnmatricesthat
consistofsingleprecisionelements.Aisaunitornonunit,upperor
lower,triangularmatrix.
TheresultmatrixXoverwritesinputmatrixB;thatis,onexittheresult
isstoredinB.MatricesAandBarestoredincolumnmajorformat,and
ldaandldbaretheleadingdimensionsofthetwodimensionalarrays
thatcontainAandB,respectively.
B
single-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B. It is overwritten with the
transformed matrix on exit.
ldb
Out put
B
.
Error St at us
if m < 0 or n < 0
or ,
where or ,
op A ( ) X alpha B * = * X op A ( ) alpha B * = *
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 197
NVI DI A
I nput
side
specifies whether op(A) appears on the left or right of X:
side == 'L' or 'l' indicates solve ;
side == 'R' or 'r' indicates solve .
uplo
specifies whether the matrix A is an upper or lower triangular matrix:
uplo == 'U' or 'u' indicates A is an upper triangular matrix;
uplo == 'L' or 'l' indicates A is a lower triangular matrix.
transa
specifies the form of op(A) to be used in matrix multiplication.
If transa == 'T', 't', 'C', or 'c', .
diag
m
specifies the number of rows of B; m must be at least zero.
n
specifies the number of columns of B; n must be at least zero.
alpha
single-precision scalar multiplier applied to B. When alpha is zero, A is
not referenced and B does not have to be a valid input.
A
single-precision array of dimensions (lda, k), where k is m when
side == 'L' or 'l' and is n when side == 'R' or 'r'. If uplo ==
'U' or 'u', the leading kk upper triangular part of the array A must
matrix of A is not referenced. When uplo == 'L' or 'l', the leading
kk lower triangular part of the array A must contain the lower
triangular matrix, and the strictly upper triangular part of A is not
referenced. Note that when diag == 'U' or 'u', the diagonal
elements of A are not referenced and are assumed to be unity.
lda
leading dimension of the two-dimensional array containing A.
When side == 'L' or 'l', lda must be at least max(1, m).
When side == 'R' or 'r', lda must be at least max(1, n).
B
single-precision array of dimensions (ldb, n); ldb must be at least
max(1, m). The leading mn part of the array B must contain the right-
hand side matrix B. On exit B is overwritten by the solution matrix X.
ldb
leading dimension of the two-dimensional array containing B; ldb
must be at least max(1, m).
op A ( ) X alpha B * = *
X op A ( ) alpha B * = *
op A ( ) A =
op A ( ) A
T
=
198 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/strsm.f
Out put
B
contains the solution matrix X satisfying or
.
Error St at us
if m < 0 or n < 0
PG- 05326- 032_V02 199
NVI DI A
ThesearethesingleprecisioncomplexBLAS3functions:
FunctioncublasCgemm()onpage 200
FunctioncublasChemm()onpage 201
FunctioncublasCherk()onpage 203
FunctioncublasCher2k()onpage 205
FunctioncublasCsymm()onpage 207
FunctioncublasCsyrk()onpage 209
FunctioncublasCsyr2k()onpage 211
FunctioncublasCtrmm()onpage 213
FunctioncublasCtrsm()onpage 215
200 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasCgemm( )
void
cublasCgemm (char transa, char transb, int m, int n,
int k, cuComplex alpha, const cuComplex *A,
int lda, const cuComplex *B, int ldb,
cuComplex beta, cuComplex *C, int ldc)
andalphaandbetaaresingleprecisioncomplexscalars.A,B,andC
arematricesconsistingofsingleprecisioncomplexelements,with
op(A)anmkmatrix,op(B)aknmatrix,andCanmnmatrix.
,
where , ,or ;
I nput
transa
If transa == 'T' or 't', .
If transa == 'C' or 'c', .
transb
If transb == 'T' or 't', .
If transb == 'C' or 'c', .
m
number of rows of matrix op(A) and rows of matrix C;
m must be at least zero.
n
k
alpha
A
single-precision complex array of dimension (lda, k) if transa ==
'N' or 'n', and of dimension (lda, m) otherwise.
lda
leading dimension of A. When transa == 'N' or 'n', it must be at
least max(1, m) and at least max(1, k) otherwise.
B
single-precision complex array of dimension (ldb, n) if transb ==
'N' or 'n', and of dimension (ldb, k) otherwise.
ldb
leading dimension of B. When transb == 'N' or 'n', it must be at
least max(1, k) and at least max(1, n) otherwise.
C alpha op A ( ) op B ( ) beta C * + * * =
op X ( ) X = op X ( ) X
T
= op X ( ) X
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op B ( ) B =
op B ( ) B
T
=
op B ( ) B
H
=
op(A)*op(B)
PG- 05326- 032_V02 201
NVI DI A
Reference:http://www.netlib.org/blas/cgemm.f
Funct ion cublasChemm( )
void
cublasChemm (char side, char uplo, int m, int n,
wherealphaandbetaaresingleprecisioncomplexscalars,Aisa
Hermitianmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofsingleprecisioncomplexelements.
beta
single-precision complex scalar multiplier applied to C. If beta is zero,
C does not have to be a valid input.
C
single-precision array of dimensions (ldc, n).
ldc
leading dimension of C; idc must be at least max(1, m).
Out put
C
Error St at us
if m < 0, n < 0, or k < 0
202 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
I nput
side
specifies whether the Hermitian matrix A appears on the left-hand side
uplo
specifies whether the Hermitian matrix A is stored in upper or lower
of the Hermitian matrix is referenced, and the elements of the strictly
Hermitian matrix is referenced, and the elements of the strictly upper
m
matrix B. It also specifies the dimensions of Hermitian matrix A when
n
columns of matrix B. It also specifies the dimensions of Hermitian
alpha
single-precision complex scalar multiplier applied to A * B or B * A.
A
single-precision complex array of dimensions (lda, ka), where ka is
m when side == 'L' or 'l' and is n otherwise. If side == 'L' or
'l', the leading mm part of array A must contain the Hermitian
matrix such that when uplo == 'U' or 'u', the leading mm part
stores the upper triangular part of the Hermitian matrix, and the
strictly lower triangular part of A is not referenced; and when uplo ==
'L' or 'l', the leading mm part stores the lower triangular part of the
Hermitian matrix, and the strictly upper triangular part is not
referenced. If side == 'R' or 'r', the leading nn part of array A
must contain the Hermitian matrix such that when uplo == 'U' or
'u', the leading nn part stores the upper triangular part of the
Hermitian matrix, and the strictly lower triangular part of A is not
referenced; and when uplo == 'L' or 'l', the leading nn part stores
upper triangular part is not referenced. The imaginary parts of the
lda
PG- 05326- 032_V02 203
NVI DI A
Reference:http://www.netlib.org/blas/chemm.f
Funct ion cublasCherk( )
void
cublasCherk (char uplo, char trans, int n, int k,
float alpha, const cuComplex *A,
int lda, float beta, cuComplex *C,
int ldc)
performsoneoftheHermitianrankkoperations
wherealphaandbetaaresingleprecisionrealscalars.Cisannn
Hermitianmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.Aisamatrixconsisting
ofsingleprecisioncomplexelementswithdimensionsofnkinthe
firstcaseandkninthesecondcase.
B
single-precision complex array of dimensions (ldb, n). On entry, the
leading mn part of the array contains the matrix B.
ldb
beta
C
single-precision complex array of dimensions (ldc, n).
ldc
Out put
C
.
Error St at us
if m < 0 or n < 0
or , C alpha A A
H
H
A * beta C * + * =
204 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
I nput
uplo
specifies whether the Hermitian matrix C is stored in upper or lower
trans
. If trans == 'T', 't', 'C', or 'c',
.
n
k
alpha
single-precision scalar multiplier applied to or .
A
k when trans == 'N' or 'n' and is n otherwise. When trans ==
'N' or 'n', the leading nk part of array A contains the matrix A;
otherwise, the leading kn part of the array contains the matrix A.
lda
beta
single-precision real scalar multiplier applied to C.
If beta is zero, C does not have to be a valid input.
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
A A
H
* A
H
A *
PG- 05326- 032_V02 205
NVI DI A
Reference:http://www.netlib.org/blas/cherk.f
Funct ion cublasCher2k( )
void
cublasCher2k (char uplo, char trans, int n, int k,
float beta, cuComplex *C, int ldc)
performsoneoftheHermitianrank2koperations
C
single-precision complex array of dimensions (ldc, n). If uplo ==
'U' or 'u', the leading nn triangular part of the array C must contain
the upper triangular part of the Hermitian matrix C, and the strictly
lower triangular part of C is not referenced. On exit, the upper
triangular part of C is overwritten by the upper triangular part of the
updated matrix. If uplo == 'L' or 'l', the leading nn triangular
part of the array C must contain the lower triangular part of the
Hermitian matrix C, and the strictly upper triangular part of C is not
referenced. On exit, the lower triangular part of C is overwritten by the
lower triangular part of the updated matrix. The imaginary parts of the
diagonal elements need not be set; they are assumed to be zero, and on
exit they are set to zero.
ldc
Out put
C
.
Error St at us
if n < 0 or k < 0
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
or
,
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
206 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaisasingleprecisioncomplexscalarandbetaisasingle
precisionrealscalar.CisannnHermitianmatrixconsistingofsingle
precisioncomplexelementsandisstoredineitherlowerorupper
storagemode.AandBarematricesconsistingofsingleprecision
complexelementswithdimensionsofnkinthefirstcaseandknin
thesecondcase.
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
n
trans == 'N' or 'n', n specifies the number of rows of matrices A
and B. If trans == 'T', 't', 'C', or 'c', n specifies the number of
columns of matrices A and B; n must be at least zero.
k
matrices A and B. If trans == 'T', 't', 'C', or 'c', k specifies the
number of rows of matrices A and B; k must be at least zero.
alpha
single-precision complex scalar multiplier.
A
lda
B
single-precision complex array of dimensions (ldb, kb), where kb is
'N' or 'n', the leading nk part of array B contains the matrix B;
otherwise, the leading kn part of the array contains the matrix B.
ldb
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
PG- 05326- 032_V02 207
NVI DI A
Reference:http://www.netlib.org/blas/cher2k.f
Funct ion cublasCsymm( )
void
cublasCsymm (char side, char uplo, int m, int n,
beta
single-precision real scalar multiplier applied to C.
C
ldc
Out put
C
or .
Error St at us
if n < 0 or k < 0
C alpha A B
H
alpha * B*A
H
+ beta*C + * * =
C alpha A
H
B alpha *B
H
*A beta *C + + * * =
208 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
wherealphaandbetaaresingleprecisioncomplexscalars,Aisa
symmetricmatrixconsistingofsingleprecisioncomplexelementsand
isstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofsingleprecisioncomplexelements.
I nput
side
uplo
m
n
alpha
single-precision complex scalar multiplier applied to A * B or B * A.
A
'l', the leading mm part of array A must contain the symmetric
stores the upper triangular part of the symmetric matrix, and the
symmetric matrix, and the strictly upper triangular part is not
must contain the symmetric matrix such that when uplo == 'U' or
symmetric matrix, and the strictly lower triangular part of A is not
the lower triangular part of the symmetric matrix, and the strictly
upper triangular part is not referenced.
PG- 05326- 032_V02 209
NVI DI A
Reference:http://www.netlib.org/blas/csymm.f
Funct ion cublasCsyrk( )
void
cublasCsyrk (char uplo, char trans, int n, int k,
int lda, cuComplex beta, cuComplex *C,
int ldc)
wherealphaandbetaaresingleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofsingleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.Aisamatrix
consistingofsingleprecisioncomplexelementswithdimensionsof
nkinthefirstcaseandkninthesecondcase.
lda
B
ldb
beta
C
single-precision complex array of dimensions (ldc, n).
ldc
Out put
C
.
Error St at us
if m < 0 or n < 0
or , C alpha A A
T
T
A * beta C * + * =
210 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
I nput
uplo
trans
. If trans == 'T', 't', 'C', or 'c',
.
n
k
alpha
single-precision complex scalar multiplier applied to or .
A
lda
beta
single-precision complex scalar multiplier applied to C.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
PG- 05326- 032_V02 211
NVI DI A
Reference:http://www.netlib.org/blas/csyrk.f
Funct ion cublasCsyr2k( )
void
cublasCsyr2k (char uplo, char trans, int n, int k,
wherealphaandbetaaresingleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofsingleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.AandBare
C
the upper triangular part of the symmetric matrix C, and the strictly
symmetric matrix C, and the strictly upper triangular part of C is not
lower triangular part of the updated matrix.
ldc
Out put
C
.
Error St at us
if n < 0 or k < 0
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
212 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
matricesconsistingofsingleprecisioncomplexelementswith
dimensionsofnkinthefirstcaseandkninthesecondcase.
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
n
k
alpha
single-precision complex scalar multiplier.
A
lda
B
single-precision complex array of dimensions (ldb, kb), where kb is
'N' or 'n', the leading nk part of array B contains the matrix B;
otherwise, the leading kn part of the array contains the matrix B.
ldb
beta
single-precision complex scalar multiplier applied to C.
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
PG- 05326- 032_V02 213
NVI DI A
Reference:http://www.netlib.org/blas/csyr2k.f
Funct ion cublasCt rmm( )
void
cublasCtrmm (char side, char uplo, char transa,
char diag, int m, int n, cuComplex alpha,
const cuComplex *B, int ldb)
alphaisasingleprecisioncomplexscalar;Bisanmnmatrix
consistingofsingleprecisioncomplexelements;andAisaunitornon
C
ldc
Out put
C
or .
Error St at us
if n < 0 or k < 0
C alpha A B
T
alpha *B*A
T
+ beta*C + * * =
C alpha A
T
B * * alpha B
T
A * * beta C * + + =
or ,
where , ,or ;
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
214 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
unit,upperorlowertriangularmatrixconsistingofsingleprecision
complexelements.
AandB,respectively.
I nput
side
uplo
transa
diag
m
n
alpha
single-precision complex scalar multiplier applied to or
, respectively. If alpha is zero, no accesses are made to
matrix A, and no read accesses are made to matrix B.
A
single-precision complex array of dimensions (lda, k). If side ==
'L' or 'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or
'u', the leading kk upper triangular part of the array A must contain
lda
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op(A)*B
B*op(A)
PG- 05326- 032_V02 215
NVI DI A
Reference:http://www.netlib.org/blas/ctrmm.f
Funct ion cublasCt rsm( )
void
cublasCtrsm (char side, char uplo, char transa,
char diag, int m, int n, cuComplex alpha,
const cuComplex *A, int lda, cuComplex *B,
int ldb)
alphaisasingleprecisioncomplexscalar,andXandBaremn
matricesthatconsistofsingleprecisioncomplexelements.Aisaunit
ornonunit,upperorlower,triangularmatrix.
B
leading mn part of the array contains the matrix B. It is overwritten
with the transformed matrix on exit.
ldb
Out put
B
.
Error St at us
if m < 0 or n < 0
or ,
where , ,or ;
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
216 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
I nput
side
uplo
transa
diag
m
n
alpha
single-precision complex scalar multiplier applied to B. When alpha is
zero, A is not referenced and B does not have to be a valid input.
A
single-precision complex array of dimensions (lda, k), where k is m
when side == 'L' or 'l' and is n when side == 'R' or 'r'. If
uplo == 'U' or 'u', the leading kk upper triangular part of the array
A must contain the upper triangular matrix, and the strictly lower
triangular matrix of A is not referenced. When uplo == 'L' or 'l',
the leading kk lower triangular part of the array A must contain the
lower triangular matrix, and the strictly upper triangular part of A is
not referenced. Note that when diag == 'U' or 'u', the diagonal
lda
B
single-precision complex array of dimensions (ldb, n); ldb must be
at least max(1, m). The leading mn part of the array B must contain
the right-hand side matrix B. On exit, B is overwritten by the solution
matrix X.
ldb
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
PG- 05326- 032_V02 217
NVI DI A
Reference:http://www.netlib.org/blas/ctrsm.f
Out put
B
.
Error St at us
if m < 0 or n < 0
218 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
precisionhardware.
ThedoubleprecisionBLAS3functionsarelistedbelow:
FunctioncublasDgemm()onpage 219
FunctioncublasDsymm()onpage 220
FunctioncublasDsyrk()onpage 222
FunctioncublasDsyr2k()onpage 224
FunctioncublasDtrmm()onpage 226
FunctioncublasDtrsm()onpage 228
PG- 05326- 032_V02 219
NVI DI A
Funct ion cublasDgemm( )
void
cublasDgemm (char transa, char transb, int m, int n,
int k, double alpha, const double *A,
int lda, const double *B, int ldb,
double beta, double *C, int ldc)
computestheproductofmatrixAandmatrixB,multipliestheresult
byscalaralpha,andaddsthesumtotheproductofmatrixCand
scalarbeta.Itperformsoneofthematrixmatrixoperations:
andalphaandbetaaredoubleprecisionscalars.A,B,andCare
matricesconsistingofdoubleprecisionelements,withop(A)anmk
matrix,op(B)aknmatrix,andCanmnmatrix.MatricesA,B,andC
arestoredincolumnmajorformat,andlda,ldb,andldcarethe
leadingdimensionsofthetwodimensionalarrayscontainingA,B,
andC.
,
where or ,
I nput
transa
If transa == 'T', 't', 'C', or 'c', .
transb
If transb == 'T', 't', 'C', or 'c', .
m
number of rows of matrix op(A) and rows of matrix C; m must be at
least zero.
n
k
alpha
A
double-precision array of dimensions (lda, k) if transa == 'N' or
'n', and of dimensions (lda, m) otherwise. If transa == 'N' or
'n', lda must be at least max(1, m); otherwise, lda must be at least
max(1, k).
lda
op X ( ) X = op X ( ) X
T
=
op A ( ) A =
op A ( ) A
T
=
op B ( ) B =
op B ( ) B
T
=
op A ( ) op B ( ) *
220 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dgemm.f
Funct ion cublasDsymm( )
void
cublasDsymm (char side, char uplo, int m, int n,
const double *B, int ldb, double beta,
double *C, int ldc)
wherealphaandbetaaredoubleprecisionscalars,Aisasymmetric
matrixconsistingofdoubleprecisionelementsandisstoredineither
B
double-precision array of dimensions (ldb, n) if transb == 'N' or
'n', and of dimensions (ldb, k) otherwise. If transb == 'N' or
'n', ldb must be at least max(1, k); otherwise, ldb must be at least
max(1, n).
ldb
leading dimension of two-dimensional array used to store matrix B.
beta
double-precision scalar multiplier applied to C. If zero, C does not have
to be a valid input.
C
double-precision array of dimensions (ldc, n); ldc must be at least
max (1, m).
ldc
leading dimension of two-dimensional array used to store matrix C.
Out put
C
updated based on .
Error St at us
if m < 0, n < 0, or k < 0
PG- 05326- 032_V02 221
NVI DI A
lowerorupperstoragemode.BandCaremnmatricesconsistingof
doubleprecisionelements.
I nput
side
uplo
m
n
alpha
double-precision scalar multiplier applied to A * B or B * A.
A
double-precision array of dimensions (lda, ka), where ka is m when
side == 'L' or 'l' and is n otherwise. If side == 'L' or 'l', the
leading mm part of array A must contain the symmetric matrix such
that when uplo == 'U' or 'u', the leading mm part stores the upper
triangular part of the symmetric matrix, and the strictly lower
triangular part of A is not referenced; and when uplo == 'L' or 'l',
the leading mm part stores the lower triangular part of the symmetric
matrix, and the strictly upper triangular part is not referenced. If
side == 'R' or 'r', the leading nn part of array A must contain the
symmetric matrix such that when uplo == 'U' or 'u', the leading
nn part stores the upper triangular part of the symmetric matrix, and
the strictly lower triangular part of A is not referenced; and when
uplo == 'L' or 'l', the leading nn part stores the lower triangular
part of the symmetric matrix, and the strictly upper triangular part is
not referenced.
lda
222 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsymm.f
Funct ion cublasDsyrk( )
void
cublasDsyrk (char uplo, char trans, int n, int k,
double beta, double *C, int ldc)
wherealphaandbetaaredoubleprecisionscalars.Cisannn
symmetricmatrixconsistingofdoubleprecisionelementsandis
storedineitherlowerorupperstoragemode.Aisamatrixconsisting
ofdoubleprecisionelementswithdimensionsofnkinthefirstcase
andkninthesecondcase.
B
double-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B.
ldb
beta
double-precision scalar multiplier applied to C. If beta is zero, C does
C
double-precision array of dimensions (ldc, n).
ldc
Out put
C
.
Error St at us
if m < 0, n < 0, or k < 0
or , C alpha A A
T
T
A * beta C * + * =
PG- 05326- 032_V02 223
NVI DI A
I nput
uplo
trans
. If trans == 'T', 't', 'C', or 'c',
.
n
k
alpha
double-precision scalar multiplier applied to or .
A
double-precision array of dimensions (lda, ka), where ka is k when
lda
beta
double-precision scalar multiplier applied to C.
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
A A
T
* A
T
A *
224 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyrk.f
Funct ion cublasDsyr2k( )
void
cublasDsyr2k (char uplo, char trans, int n, int k,
const double *B, int ldb, double beta,
double *C, int ldc)
C
double-precision array of dimensions (ldc, n). If uplo == 'U' or
'u', the leading nn triangular part of the array C must contain the
upper triangular part of the symmetric matrix C, and the strictly lower
the updated matrix.
ldc
Out put
C
.
Error St at us
if m < 0, n < 0, or k < 0
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
or
,
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 225
NVI DI A
symmetricmatrixconsistingofdoubleprecisionelementsandis
storedineitherlowerorupperstoragemode.AandBarematrices
consistingofdoubleprecisionelementswithdimensionofnkinthe
firstcaseandkninthesecondcase.
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
n
trans == 'N' or 'n'', n specifies the number of rows of matrix A. If
k
If trans == 'N' or 'n', k specifies the number of columns of matrix
A. If trans == 'T', 't', 'C', or 'c', k specifies the number of rows
of matrix A; k must be at least zero.
alpha
double-precision scalar multiplier.
A
lda
B
double-precision array of dimensions (ldb, kb), where kb = k when
ldb
beta
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
226 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dsyr2k.f
Funct ion cublasDt rmm( )
void
cublasDtrmm (char side, char uplo, char transa,
char diag, int m, int n, double alpha,
const double *A, int lda, const double *B,
int ldb)
C
the updated matrix.
ldc
Out put
C
or
.
Error St at us
if m < 0, n < 0, or k < 0
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where or ,
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 227
NVI DI A
alphaisadoubleprecisionscalar,Bisanmnmatrixconsistingof
doubleprecisionelements,andAisaunitornonunit,upperorlower
triangularmatrixconsistingofdoubleprecisionelements.
AandB,respectively.
I nput
side
uplo
transa
specifies the form of op(A) to be used in the matrix multiplication.
If transa == 'T', 't', 'C', or 'c', .
diag
m
n
alpha
double-precision scalar multiplier applied to or ,
respectively. If alpha is zero, no accesses are made to matrix A, and
no read accesses are made to matrix B.
A
double-precision array of dimensions (lda, k). If side == 'L' or
'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or 'u', the
leading kk upper triangular part of the array A must contain the
lda
op A ( ) A =
op A ( ) A
T
=
op(A)*B B*op(A)
228 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtrmm.f
Funct ion cublasDt rsm( )
void
cublasDtrsm (char side, char uplo, char transa,
char diag, int m, int n, double alpha,
const double *A, int lda, double *B,
int ldb)
alphaisadoubleprecisionscalar,andXandBaremnmatricesthat
consistofdoubleprecisionelements.Aisaunitornonunit,upperor
lower,triangularmatrix.
B
double-precision array of dimensions (ldb, n). On entry, the leading
mn part of the array contains the matrix B. It is overwritten with the
transformed matrix on exit.
ldb
Out put
B
.
Error St at us
if m < 0, n < 0, or k < 0
or ,
where or ,
op A ( ) A = op A ( ) A
T
=
PG- 05326- 032_V02 229
NVI DI A
I nput
side
uplo
transa
specifies the form of op(A) to be used in matrix multiplication.
If transa == 'T', 't', 'C', or 'c', .
diag
m
n
alpha
double-precision scalar multiplier applied to B. When alpha is zero, A
is not referenced and B does not have to be a valid input.
A
double-precision array of dimensions (lda, k), where k is m when
side == 'L' or 'l' and is n when side == 'R' or 'r'. If uplo ==
'U' or 'u', the leading kk upper triangular part of the array A must
matrix of A is not referenced. When uplo == 'L' or 'l', the leading
kk lower triangular part of the array A must contain the lower
triangular matrix, and the strictly upper triangular part of A is not
referenced. Note that when diag == 'U' or 'u', the diagonal
lda
B
double-precision array of dimensions (ldb, n); ldb must be at least
max(1, m). The leading mn part of the array B must contain the right-
hand side matrix B. On exit, B is overwritten by the solution matrix X.
ldb
op A ( ) A =
op A ( ) A
T
=
230 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/dtrsm.f
Out put
B
.
Error St at us
if m < 0, n < 0, or k < 0
PG- 05326- 032_V02 231
NVI DI A
Double- Precision Complex BLAS3 Funct ions
precisionhardware.
FivedoubleprecisioncomplexBLAS3functionsareimplemented:
FunctioncublasZgemm()onpage 232
FunctioncublasZhemm()onpage 233
FunctioncublasZherk()onpage 235
FunctioncublasZher2k()onpage 238
FunctioncublasZsymm()onpage 240
FunctioncublasZsyrk()onpage 242
FunctioncublasZsyr2k()onpage 244
FunctioncublasZtrmm()onpage 246
FunctioncublasZtrsm()onpage 248
232 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZgemm( )
void
cublasZgemm (char transa, char transb, int m, int n,
int k, cuDoubleComplex alpha,
const cuDoubleComplex *B, int ldb,
cuDoubleComplex beta, cuDoubleComplex *C,
int ldc)
andalphaandbetaaredoubleprecisioncomplexscalars.A,B,andC
arematricesconsistingofdoubleprecisioncomplexelements,with
op(A)anmkmatrix,op(B)aknmatrix,andCanmnmatrix.
,
where , ,or ;
I nput
transa
transb
If transb == 'T' or 't', .
If transb == 'C' or 'c', .
m
number of rows of matrix op(A) and rows of matrix C;
m must be at least zero.
n
k
alpha
A
double-precision complex array of dimension (lda, k) if transa ==
'N' or 'n', and of dimension (lda, m) otherwise.
lda
leading dimension of A. When transa == 'N' or 'n', it must be at
least max(1, m) and at least max(1, k) otherwise.
B
double-precision complex array of dimension (ldb, n) if transb ==
'N' or 'n', and of dimension (ldb, k) otherwise.
op X ( ) X = op X ( ) X
T
= op X ( ) X
H
=
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op B ( ) B =
op B ( ) B
T
=
op B ( ) B
H
=
op(A)*op(B)
PG- 05326- 032_V02 233
NVI DI A
Reference:http://www.netlib.org/blas/zgemm.f
Funct ion cublasZhemm( )
void
cublasZhemm (char side, char uplo, int m, int n,
int ldc)
wherealphaandbetaaredoubleprecisioncomplexscalars,Aisa
Hermitianmatrixconsistingofdoubleprecisioncomplexelements
ldb
leading dimension of B. When transb == 'N' or 'n', it must be at
least max(1, k) and at least max(1, n) otherwise.
beta
double-precision complex scalar multiplier applied to C. If beta is
zero, C does not have to be a valid input.
C
double-precision array of dimensions (ldc, n).
ldc
leading dimension of C; idc must be at least max(1, m).
Out put
C
Error St at us
if m < 0, n < 0, or k < 0
234 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
andisstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofdoubleprecisioncomplexelements.
I nput
side
specifies whether the Hermitian matrix A appears on the left-hand side
uplo
specifies whether the Hermitian matrix A is stored in upper or lower
m
matrix B. It also specifies the dimensions of Hermitian matrix A when
n
columns of matrix B. It also specifies the dimensions of Hermitian
alpha
double-precision complex scalar multiplier applied to A * B or B * A.
A
double-precision complex array of dimensions (lda, ka), where ka is
'l', the leading mm part of array A must contain the Hermitian
stores the upper triangular part of the Hermitian matrix, and the
Hermitian matrix, and the strictly upper triangular part is not
must contain the Hermitian matrix such that when uplo == 'U' or
Hermitian matrix, and the strictly lower triangular part of A is not
upper triangular part is not referenced. The imaginary parts of the
lda
PG- 05326- 032_V02 235
NVI DI A
Reference:http://www.netlib.org/blas/zhemm.f
Funct ion cublasZherk( )
void
cublasZherk (char uplo, char trans, int n, int k,
double alpha, const cuDoubleComplex *A,
int lda, double beta, cuDoubleComplex *C,
int ldc)
performsoneoftheHermitianrankkoperations
Hermitianmatrixconsistingofdoubleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.Aisamatrix
B
double-precision complex array of dimensions (ldb, n). On entry, the
ldb
beta
C
double-precision complex array of dimensions (ldc, n).
ldc
Out put
C
.
Error St at us
if m < 0 or n < 0
or , C alpha A A
H
H
A * beta C * + * =
236 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
consistingofdoubleprecisioncomplexelementswithdimensionsof
nkinthefirstcaseandkninthesecondcase.
I nput
uplo
trans
. If trans == 'T', 't', 'C', or 'c',
.
n
k
alpha
double-precision scalar multiplier applied to or .
A
lda
beta
double-precision scalar multiplier applied to C.
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
A A
H
* A
H
A *
PG- 05326- 032_V02 237
NVI DI A
Reference:http://www.netlib.org/blas/zherk.f
C
double-precision complex array of dimensions (ldc, n). If uplo ==
ldc
Out put
C
.
Error St at us
if n < 0 or k < 0
C alpha A A
H
beta C * + * * =
C alpha A
H
A * beta C * + * =
238 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZher2k( )
void
cublasZher2k (char uplo, char trans, int n, int k,
double beta, cuDoubleComplex *C, int ldc)
performsoneoftheHermitianrank2koperations
wherealphaisadoubleprecisioncomplexscalarandbetaisa
doubleprecisionrealscalar.CisannnHermitianmatrixconsistingof
doubleprecisioncomplexelementsandisstoredineitherloweror
upperstoragemode.AandBarematricesconsistingofdouble
precisioncomplexelementswithdimensionsofnkinthefirstcase
andkninthesecondcase.
or
,
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
n
k
alpha
double-precision complex multiplier.
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
C alpha A B
H
alpha B A
H
beta C + + * * + * * =
C alpha A
H
B alpha B
H
A beta C + + * * + * * =
PG- 05326- 032_V02 239
NVI DI A
Reference:http://www.netlib.org/blas/zher2k.f
A
lda
B
double-precision array of dimensions (ldb, kb), where kb is k when
'n', the leading nk part of array B contains the matrix B; otherwise,
the leading kn part of the array contains the matrix B.
ldb
beta
double-precision real scalar multiplier applied to C.
C
upper triangular part of the Hermitian matrix C, and the strictly lower
must contain the lower triangular part of the Hermitian matrix C, and
the updated matrix. The imaginary parts of the diagonal elements
need not be set; they are assumed to be zero, and on exit they are set
to zero.
ldc
Out put
C
or .
Error St at us
if n < 0 or k < 0
C alpha A B
H
alpha * B*A
H
+ beta*C + * * =
C alpha A
H
B alpha *B
H
*A beta *C + + * * =
240 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZsymm( )
void
cublasZsymm (char side, char uplo, int m, int n,
const cuDoubleComplex *A,
int lda, const cuDoubleComplex *B, int ldb,
int ldc)
wherealphaandbetaaredoubleprecisioncomplexscalars,Aisa
symmetricmatrixconsistingofdoubleprecisioncomplexelements
andisstoredineitherlowerorupperstoragemode.BandCaremn
matricesconsistingofdoubleprecisioncomplexelements.
or ,
I nput
side
uplo
m
C alpha A B beta C * + * * = C alpha B A beta C * + * * =
PG- 05326- 032_V02 241
NVI DI A
Reference:http://www.netlib.org/blas/zsymm.f
n
alpha
double-precision complex scalar multiplier applied to A * B or B * A.
A
'l', the leading mm part of array A must contain the symmetric
stores the upper triangular part of the symmetric matrix, and the
symmetric matrix, and the strictly upper triangular part is not
must contain the symmetric matrix such that when uplo == 'U' or
symmetric matrix, and the strictly lower triangular part of A is not
the lower triangular part of the symmetric matrix, and the strictly
upper triangular part is not referenced.
lda
B
ldb
beta
C
double-precision complex array of dimensions (ldc, n).
ldc
Out put
C
.
242 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZsyrk( )
void
cublasZsyrk (char uplo, char trans, int n, int k,
cuDoubleComplex *C, int ldc)
wherealphaandbetaaredoubleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofdoubleprecisioncomplex
elementsandisstoredineitherlowerorupperstoragemode.Aisa
matrixconsistingofdoubleprecisioncomplexelementswith
dimensionsofnkinthefirstcaseandkninthesecondcase.
Error St at us
if m < 0 or n < 0
or ,
I nput
uplo
trans
. If trans == 'T', 't', 'C', or 'c',
.
C alpha A A
T
T
A * beta C * + * =
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
PG- 05326- 032_V02 243
NVI DI A
Reference:http://www.netlib.org/blas/zsyrk.f
n
k
alpha
double-precision complex scalar multiplier applied to or
.
A
lda
beta
double-precision complex scalar multiplier applied to C.
C
double-precision complex array of dimensions (ldc, n). If uplo ==
ldc
Out put
C
.
A A
T
*
A
T
A *
C alpha A A
T
beta C * + * * =
C alpha A
T
A * beta C * + * =
244 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Funct ion cublasZsyr2k( )
void
cublasZsyr2k (char uplo, char trans, int n, int k,
cuDoubleComplex *C, int ldc)
wherealphaandbetaaredoubleprecisioncomplexscalars.Cisan
nnsymmetricmatrixconsistingofdoubleprecisioncomplex
elementsandisstoredineitherlowerorupperstoragemode.AandB
arematricesconsistingofdoubleprecisioncomplexelementswith
dimensionofnkinthefirstcaseandkninthesecondcase.
Error St at us
if n < 0 or k < 0
or
,
I nput
uplo
trans
. If trans == 'T',
't', 'C', or 'c', .
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
PG- 05326- 032_V02 245
NVI DI A
n
trans == 'N' or 'n'', n specifies the number of rows of matrices A
and B. If trans == 'T', 't', 'C', or 'c', n specifies the number of
columns of matrices A and B; n must be at least zero.
k
matrices A and B. If trans == 'T', 't', 'C', or 'c', k specifies the
number of rows of matrices A and B; k must be at least zero.
alpha
double-precision scalar multiplier.
A
lda
B
double-precision array of dimensions (ldb, kb), where kb = k when
ldb
beta
C
the updated matrix.
ldc
246 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/zsyr2k.f
Funct ion cublasZt rmm( )
void
cublasZtrmm (char side, char uplo, char transa,
char diag, int m, int n,
const cuDoubleComplex *B, int ldb)
alphaisadoubleprecisioncomplexscalar;Bisanmnmatrix
consistingofdoubleprecisioncomplexelements;andAisaunitor
nonunit,upperorlowertriangularmatrixconsistingofdouble
AandB,respectively.
Out put
C
or
.
Error St at us
if m < 0, n < 0, or k < 0
C alpha A B
T
alpha B A
T
beta C * + * * + * * =
C alpha A
T
B alpha B
T
A beta C * + * * + * * =
or ,
where , ,or ;
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 247
NVI DI A
I nput
side
uplo
transa
diag
m
n
alpha
double-precision complex scalar multiplier applied to or
, respectively. If alpha is zero, no accesses are made to
matrix A, and no read accesses are made to matrix B.
A
double-precision complex array of dimensions (lda, k). If side ==
'L' or 'l', k = m. If side == 'R' or 'r', k = n. If uplo == 'U' or
'u', the leading kk upper triangular part of the array A must contain
lda
B
leading mn part of the array contains the matrix B. It is overwritten
with the transformed matrix on exit.
ldb
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
op(A)*B
B*op(A)
248 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztrmm.f
Funct ion cublasZt rsm( )
void
cublasZtrsm (char side, char uplo, char transa,
char diag, int m, int n,
cuDoubleComplex *B, int ldb)
alphaisadoubleprecisioncomplexscalar,andXandBaremn
matricesthatconsistofdoubleprecisioncomplexelements.Aisaunit
ornonunit,upperorlower,triangularmatrix.
Out put
B
.
Error St at us
if m < 0 or n < 0
or ,
where , ,or ;
op A ( ) A = op A ( ) A
T
= op A ( ) A
H
=
PG- 05326- 032_V02 249
NVI DI A
I nput
side
uplo
transa
diag
m
n
alpha
double-precision complex scalar multiplier applied to B. When alpha
is zero, A is not referenced and B does not have to be a valid input.
A
double-precision complex array of dimensions (lda, k), where k is m
when side == 'L' or 'l' and is n when side == 'R' or 'r'. If
uplo == 'U' or 'u', the leading kk upper triangular part of the array
A must contain the upper triangular matrix, and the strictly lower
triangular matrix of A is not referenced. When uplo == 'L' or 'l',
the leading kk lower triangular part of the array A must contain the
lower triangular matrix, and the strictly upper triangular part of A is
not referenced. Note that when diag == 'U' or 'u', the diagonal
lda
B
double-precision complex array of dimensions (ldb, n); ldb must be
at least max(1, m). The leading mn part of the array B must contain
the right-hand side matrix B. On exit, B is overwritten by the solution
matrix X.
ldb
op A ( ) A =
op A ( ) A
T
=
op A ( ) A
H
=
250 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Reference:http://www.netlib.org/blas/ztrsm.f
Out put
B
.
Error St at us
if m < 0 or n < 0
PG- 05326- 032_V02 251
NVI DI A
A P P E N D I X
A
CUBLAS For t r an Bi ndi ngs
CUBLAisimplementedusingtheCbasedCUDAtoolchainandthus
providesaCstyleAPI.Thismakesinterfacingtoapplicationswritten
inCorC++trivial.Inaddition,therearemanyapplications
implementedinFortranthatwouldbenefitfromusingCUBLAS.
CUBLASuses1basedindexingandFortranstylecolumnmajor
storageformultidimensionaldatatosimplifyinterfacingtoFortran
applications.Unfortunately,FortrantoCcallingconventionsarenot
standardizedanddifferbyplatformandtoolchain.Inparticular,
differencesmayexistinthefollowingareas:
symbolnames(capitalization,namedecoration)
argumentpassing(byvalueorreference)
passingofstringarguments(lengthinformation)
passingofpointerarguments(sizeofthepointer)
returningfloatingpointorcompounddatatypes(forexample,
singleprecisionorcomplexdatatypes)
Toprovidemaximumflexibilityinaddressingthosedifferences,the
CUBLASFortraninterfaceisprovidedintheformofwrapper
functions,whicharewritteninCandprovidedintwoforms:
thethunkingwrapperinterfaceinthefilefortran_thunking.c
thedirectwrapperinterfaceinthefilefortran.c
252 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Thecodeofoneofthosetwofilesmustbecompiledintoanapplication
forittocalltheCUBLASAPIfunctions.Providingsourcecodeallows
userstomakeanychangesnecessaryforaparticularplatformand
toolchain.
ThecodeinthosetwoCfileshasbeenusedtodemonstrate
interoperabilitywiththecompilersg773.2.3andg950.91on32bit
Linux,g773.4.5andg950.91on64bitLinux,IntelFortran9.0andIntel
Fortran10.0on32bitand64bitMicrosoftWindowsXP,andg773.4.0
andg950.92onMacOSX.
Notethatforg77,useofthecompilerflag-fno-second-underscore
isrequiredtousethesewrappersasprovided.Also,theuseofthe
defaultcallingconventionswithregardtoargumentandreturnvalue
passingisexpected.Usingtheflag-fno-f2cchangesthedefault
callingconventionwithrespecttothesetwoitems.
ThethunkingwrappersallowinterfacingtoexistingFortran
applicationswithoutanychangestotheapplication.Duringeachcall,
thewrappersallocateGPUmemory,copysourcedatafromCPU
memoryspacetoGPUmemoryspace,callCUBLAS,andfinallycopy
backtheresultstoCPUmemoryspaceanddeallocatetheGPU
memory.Asthisprocesscausesverysignificantcalloverhead,these
wrappersareintendedforlighttesting,notforproductioncode.To
usethethunkingwrappers,theapplicationneedstobecompiledwith
thefilefortran_thunking.c.
Thedirectwrappers,intendedforproductioncode,substitutedevice
pointersforvectorandmatrixargumentsinallBLASfunctions.Touse
theseinterfaces,existingapplicationsmustbemodifiedslightlyto
allocateanddeallocatedatastructuresinGPUmemoryspace(using
CUBLAS_ALLOCandCUBLAS_FREE)andtocopydatabetweenGPUand
CPUmemoryspace(withCUBLAS_SET_VECTOR,CUBLAS_GET_VECTOR,
CUBLAS_SET_MATRIX,andCUBLAS_GET_MATRIX).Thesample
wrappersprovidedinfortran.cmapdevicepointerstothe
operatingsystemdependenttypesize_t,whichis32bitswideon32
bitplatformsand64bitswideona64bitplatforms.
Oneapproachtodealingwithindexarithmeticondevicepointersin
FortrancodeistouseCstylemacrosandtousetheCpreprocessorto
expandthese,asshownbelowinExample A.2.OnLinuxandMacOS
X,preprocessingcanbedoneusingthe'-E -x f77-cpp-input'
PG- 05326- 032_V02 253
NVI DI A
CHAPTER A CUBLAS Fort ran Bindings
optionwiththeg77compiler,orsimplyusingthe'-cpp'optionwith
g95orgfortran.OnWindowsplatformswithMicrosoftVisualC/C++,
using'cl-EP'achievessimilarresults.
WhentraditionalfixedformFortran77codeisportedtoCUBLAS,line
lengthoftenincreaseswhentheBLAScallsareexchangedforCUBLAS
calls.Longerfunctionnamesandpossiblemacroexpansionare
contributingfactors.Inadvertentlyexceedingthemaximumline
lengthcanleadtoruntimeerrorsthataredifficulttofind,socare
shouldbetakennottoexceedthe72columnlimitiffixedformis
retained.
Thefollowingtwoexamplesshowasmallapplicationimplementedin
Fortran77onthehost(Example A.1.,Fortran77Application
ExecutingontheHostonpage 254),andshowthesameapplication
usingthenonthunkingwrappersafterithasbeenportedtouse
CUBLAS(Example A.2.,Fortran77ApplicationPortedtoUse
CUBLASonpage 254).
ThesecondexampleshouldbecompiledwithARCH_64definedas1on
64bitoperatingsystemsandas0on32bitoperatingsystems.For
example,ong95orgfortranitcanbedonedirectlyonthecommand
linebyusingtheoption'-cpp -DARCH_64=1'.
254 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
Example A. 1. Fort ran 77 Applicat ion Execut ing on t he Host

subroutine modify (m, ldm, n, p, q, alpha, beta)
implicit none
real*4 m(ldm,*), alpha, beta
external sscal
call sscal (n-p+1, alpha, m(p,q), ldm)
call sscal (ldm-p+1, beta, m(p,q), 1)
return
end

program matrixmod
implicit none
integer M, N
parameter (M=6, N=5)
real*4 a(M,N)
integer i, j
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
enddo
enddo
call modify (a, M, N, 2, 3, 16.0, 12.0)
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS
subroutine modify (devPtrM, ldm, n, p, q, alpha, beta)
PG- 05326- 032_V02 255
NVI DI A
CHAPTER A CUBLAS Fort ran Bindings
implicit none
integer sizeof_real
parameter (sizeof_real=4)
#if ARCH_64
integer*8 devPtrM
#else
integer devPtrM
#endif
real*4 alpha, beta
call cublas_sscal (n-p+1, alpha,
1 devPtrM+IDX2F(p,q,ldm)*sizeof_real,
2 ldm)
call cublas_sscal (ldm-p+1, beta,
1 devPtrM+IDX2F(p,q,ldm)*sizeof_real,
2 1)
return
end
program matrixmod
implicit none
integer M, N, sizeof_real
#if ARCH_64
integer*8 devPtrA
#else
integer devPtrA
#endif
parameter (M=6, N=5, sizeof_real=4)
real*4 a(M,N)
integer i, j, stat
external cublas_init, cublas_set_matrix, cublas_get_matrix
external cublas_shutdown, cublas_alloc
integer cublas_alloc, cublas_set_matrix, cublas_get_matrix
do j = 1, N
do i = 1, M
a(i,j) = (i-1) * M + j
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS ( cont inued)
256 PG- 05326- 032_V02
NVI DI A
CUDA CUBLAS Library
enddo
enddo

call cublas_init
stat = cublas_alloc(M*N, sizeof_real, devPtrA)
if (stat .NE. 0) then
write(*,*) "device memory allocation failed"
call cublas_shutdown
stop
endif
stat = cublas_set_matrix (M, N, sizeof_real, a, M, devPtrA, M)
call cublas_free (devPtrA)
write(*,*) "data download failed"
stop
endif
call modify (devPtrA, M, N, 2, 3, 16.0, 12.0)
stat = cublas_get_matrix (M, N, sizeof_real, devPtrA, M, a, M)
write(*,*) "data upload failed"
stop
endif
do j = 1, N
do i = 1, M
write(*,"(F7.0$)") a(i,j)
enddo
write (*,*) ""
enddo
stop
end
Example A. 2. Fort ran 77 Applicat ion Port ed t o Use CUBLAS ( cont inued)

Cublas Library

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cublas Library

Uploaded by

Copyright:

Available Formats

CUDA

Example 1. Fort ran 77 Applicat ion Execut ing on t he Host

Example 2. Applicat ion Using C and CUBLAS: 1- based I ndexing

Example A. 1. Fort ran 77 Applicat ion Execut ing on t he Host

You might also like