/  10
 
OptimizingFPGA-basedvectorproductdesigns 
DanielBenyami
,WayneLu
,andJohnVillasenor 
ElectricalEngineeringDepartment,UCLA,LosAngeles,CA,90095 
benyamin,vill
@icsl.ucla.ed
DepartmentofComputing,ImperialCollege 180Queen'sGate,London,EnglandSW72B
Abstract 
Thispaperpresentsamethod,calledmultiplecon- stantmultipliertreesMCMTs,forproducingoptimizedrecongurablehardwareimplementationsofvec- torproducts.AnalgorithmforgeneratingMCMThasbeendevelopedandimplemented,whichisbaseonanovelrepresentationofcommonsubexpressions inconstantdatapatterns.Ouroptimizationframeworkcoversawidersolutionspacethanpreviousap- proaches;italsosupportsexploitationoffullandpartialrun-timerecongurationaswellastechnology- specicconstraints,suchasfanoutlimitsandrouting. Wedemonstratethatwhiledistributedarithmetictech- niquesrequirestoragesizeexponentialinthenumber ofcoecients,theresourceutilizationofMCMTsusu- allygrowslinearlywithproblemsize.MCMTshavbeenimplementedinXilinx4000andVirtexFPGAsandtheirsizeandspeedeciencyareconrmedicomparisonswithXilinxLogiCoreandASICimplementationsofFIRlterdesigns.PreliminaryresultshowthatthesizeofMCMTcircuitsislessthanhalofthatofcomparabledistributedarithmeticcores
1Introduction 
Thevectordotproductorsumofproducts"ioneofthemostcommonalgorithmickernelsinsignal processing.Itsuserangesfromltersandcorrelatortocompletetwo-dimensionalimagetransformssuchas theDiscreteCosineTransformDCTTh
mojave 
ProjectatUCLA1hasshown thatFPGAscanbehighlyeectiveincomputing two-dimensionalcorrelationsusing1-bitaddertreesSpecically,theauthorsindicateanimportantwayof leveragingtherecongurabilityofFPGAs:anFPGcircuitdesignercanaordtoimplementcircuitsbased onparametersthatrarelyorneverchange;thesecircuitscanbedesignedveryecientlyduetotheavailabilityof 
aprior
knowledge.Variable-operandcircuitsarepreferredoverconstant-operandcircuitsiASICdesigns,duetothehighnon-recurringengineeringcostrequiredtochangetheconstantoperandsIndeed,constant-operandcircuitsonlyndtheirway intoASICswhentheyarewidelyapplicable;the8x8 matrixofDCTcoecientsisonesuchexampleUnlesstheproblemsizeissmall,however,implementingeectiveconstant-operandFPGAcircuitsby handcanbeverytedious.ThispaperpresentshardwareimplementationofvectorproductscallemultipleconstantmultipliertreesMCMTs.The coreoftheMCMTalgorithmisournovelrepresenta- tionofcommonsubexpressionscontainedinconstant datapatterns,whichprovidesforseveraloptimiza- tioncapabilitiesnotpresentinpreviousdesignstrategies.First,weadoptaglobaloptimizationstrategtocoverawidesolutionspace,thusgeneratingsolu- tionsnotconsideredbypreviousapproaches.Second, ourmethodisabletoexploittherun-timerecong- urabilityofFPGAs,whichcanprofoundlyreducethe eectivecircuitsize.Third,technology-speciccon- straints,suchasfanoutlimitsandrouting,canbe takenintoaccountbyouroptimizationprocedure. 
2OverviewandRelatedWor
Anumberofhardwareimplementationsexistfor computingvectorproducts.Therearestraightforwarddesigns,wherethevectorproductiscomputeusingconstantcoecientmultipliersKCMsand adders.Othertechniques,suchasdistributedarith- meticDA,directlycomputevectorproductsinalesobviousmanner.Insection8weprovidecomparisons ofMCMTstoDAandothertraditionalarchitecturesOurapproachistobreakdownmultipleconstanttothebitlevelandrepresenttheminamannerthat 
 
isecientforoptimizingtherequirednumberofad- ditions.Whilemultiplicationswithaconstantvalue arecommonlyperformedusingshiftsandadditionsresearchonthejointoptimizationofmultiplecon- stantsisrelativelynew.Potkonjak4presentsagen- eralizedsystemnamedmultipleconstantmultiplierMCMs.Thisapproachcanbeappliedtoanyproblemwhereanunknownismultipliedbymorethanone constantvalue,andcanoptimizeforboththenumber ofadditionsandthenumberofrequiredshifts.Feher6concentratesongeneratingvectorproductcircuitsforFPGAs,andappliesthetechniquestothe two-dimensionalDCT.Feher'sworkimplementsonlbit-serialdesigns,whichnotonlysimpliestheprob- lemofhighfanoutnodes,but,duetothenecessarparallel-serialconversion,isonlyofinterestwhenthapplicationcanaordalowthroughput.FurthermoreFeher'salgorithmutilizesapairwisegreedyapproach, whichprovidespoorresultsformanyproblemcasesAdditionalworkincludesrecodingformultiplecoecients5andoptimizationsforDSParchitectures3Thesetechniques,however,donotexploitthefullpo- tentialofFPGAs,especiallyinregardstorun-timrecongurationRTR
3ConstantMultiplierTree
ForthispaperwewillapplytheMCMTtechnique toproblemsinvolvingthevectorproduct
X   
=
1 wher
isanunknownvector,butthevecto
isknownatcompiletime.Althoughwedevelopmethodologybasedonthissimplevectorproduct,the sametechniquesareapplicabletoanymultiplecon- stantmultiplicationproblemWebeginbydecomposingthevectorproductinmannerthattakesadvantageofour 
apriori 
knowledge oftheconstantvector 
.Specically,wewrit
X   
=
l
wher
l
isasinglebitand 
istheprecisionofthe coecients.Weassumeforsimplicitythat 
ispositiveinteger,butthederivationcanbeappliedtsignedxed-pointsystemsaswell.Substitutingthiinto1gives 
X   
=
  
X   
=
l
andafterexchangingtheorderofsummations,whave 
X   
=
  
X   
=
l
2 Notethat 
denotesthemostsignicantbitof thecoecient 
Therearesomeimportantfeaturesof2thashouldbenoted.First,nohardwaremultipliersarneeded,since 
l
takesonvalues0or1,leavingonlshiftstoimplementthe
multiplication.Second, thedecompositionof 
intobit-planesbyequation 2allowsforsimpleexploitationofthesymmetries andcommonsubexpressionscontainedinthecoecientdata.Moreover,since 
l
isknownatcompiltime,weperformonlytheadditionsthatarenecessarduringruntimeInordertogeneratetreesfromavectorproduct,iisusefultorepresent2inthefollowingmanner
::
:::
;
:::::::
;
::
withthedesiredresul
::
Denotethecoecientmatrixof 
l
elementsforout- pu
a
.Thismatrixwillbethebasisofallop- timizationsandtransformationsinthispaper.Notthat 
consistsofbinarydataonly,withthetoprow containingtheleastsignicantbitsofthe 
elements o
,andthebottomrowcontainingtheirmostsig- nicantbitsTheproductofeachrowof 
wit
iscomputed usingabinarytree;ifarowof 
containsoneormor0's,thenthecorrespondingtreewillbeaprunedtreeRegardless,the 
trees
,
,...Y 
aresummedto- getherusingonenaladdertreecombinedwithshifttoproducethevalueof 
.Forsimplicity,ourgures showalinearizedversionofthisnaltree;inpractice abinarytreewilloftenbeused3Asanexampleusedthroughoutthissection,sup- posewewishtocomput
+
+
+
+
+
+
+
Inthiscas
=12345678,an
i
10101010 01100110 00011110 00000001 
3 
 
+++ + + +++
 
 
 
 
 
 
 
 
+
+
++

 
 
 

 

 

 

 

 
Figure1:Anunoptimizedaddertreeforcomputing avectorproduct.Shadedregionsrepresent1'sincoecient'sbitplane. Figure1illustratestheunoptimizedtreestructuretcomputethisexpressionforanarbitraryvector 
.The coecientsforthefoursubtreesfromlefttorightcorrespondtotherowsof3fromtoptobottomTheleavesofFigure1areregisterstoholdthevec- to
;asillustratedtherearefourcopiesof 
,one foreachbitplaneofcoecientdata.Thetreeiscon- structedsuchthattheshift-addsoccurinorderofsig- nicance:recallthatthedata'smostsignicantbitsinourrepresentation,arelocatedatthebottomrow o
.Thisrowisshiftedleftonebitandaddedtthenextrowaboveit,withtheprocessrepeatingfor eachrow.Thus,Figure1compute
2 
2 
+2 
 
Notethatthetreeiscompletelyspeciedbythe datacontainedi
;thisallowsustooptimizethe treestructuresimplybyapplyingtransformationstthematri
4Optimizations 
MCMTsrelyheavilyoncommonsubexpressioeliminationCSEtooptimizetheaddertrees.CSE isimplementedeasilyintrees,sinceacommonsubex- pressionisasubtreecommontotwoormoredierenttrees,andiseliminated"bysharingonesubtree'outputandremovingtheothers.Whilenon-trivial forrandomlogic,theformulationof 
allowsforthe following,systematicapproach
4.1Sharingcommonsubexpressions 
SincethefoursetsofeightleavesinFigure1contaiidenticalvalues,ahardwarevectormultiplierwoulcombinethefoursubtreescorrespondingto
,
,and
intoasingletree.Acloseinspectionof Figure1revealsredundanciesinthesubtrees,suchas thevertexconnectedto
and
inboth
an
;clearly,onlyoneadderisnecessary,andthuswlabel
+
asacommonsubexpression.Thiscan beseeninthematri
inFigure2,withthefour 1'smarke
 
.Sinceaddersareusedtocomputeeacrow,whenevertwoormore1'smatch,thereisapossibilityofacommonsubexpression. Ingeneral,asubexpression 
 
isasetoflocationsof 1'si
,wheretherowandcolumnsizeisatleasttwsinceanadderneedsatleasttwooperands.Aformal denitionofsubexpressionsisnowprovided,concludingwiththedenitionof
vali
groupofsubexpressions. Denethe 
dotproduct 
oftwoset
an
a
 
 
a;
Aand
Giventhat 
isofsize 
 
,dene 
tobetheseofcolumnsof 
;:::;K 
an
tobethesetofrowsof 
;:::;L 
Giventhat
and
arerespectivelyth
powerset
o
an
,the 
-thcommonsubexpression 
 
o
isdenedas 
 
 
wher
an
aresetthatmeetthefollowingconditions: 1
an
2
r;
;
=1foral
r;
an
c;
3
j 
2an
j 
InFigure2,forexample
0,
an
2,
TwoexampleCSEsareillustratedinFigure2Bot
 
an
 
containfourelements,an
2forbothaswell.Notealsothat 
 
T  
 
6 
Ingeneral,intersectin
 
'sproducevalidcircuits,buinthiscaseitdoesnotsinc
wouldbeaddetwice.Thus,wemustalterthedenitionofintersectionslightlyinordertoguaranteeacorrectresult

Share & Embed

More from this user

Recent Readcasters

Add a Comment

Characters: ...