Professional Documents
Culture Documents
Caches OptimizationTechniques
MemoryOrganization
DataLocality
Temporal:ifonedataitemneedednow,
Temporal: if one data item needed now
itislikelytobeneededagaininnearfuture
Spatial:ifonedataitemneedednow,
p
nearbydatalikelytobeneededinnearfuture
ExploitingLocality:Caches
Keeprecentlyuseddata infastmemoryclosetothe processor
Alsobringnearbydatathere
Also bring nearby data there
EE/CS520 Comp.Archi.
11/13/2012
MemoryOrganization
Basicidea:
Implementamemoryhierarchy:
Smallsize,fast,closetoprocessor
Largesize,slow,farfromprocessor
Capacity+
Speed
Disk
MainMemory
y
L3Cache
L2Cache
ITLB
InstructionCache
DataCache
DTLB
RegisterFile
BypassNetwork
EE/CS520 Comp.Archi.
Speed+
Capacity
11/13/2012
ImprovingCachePerformance
AMAT=hittime+missrate*misspenalty
Reducemisspenalty
Reducemissrate
Reducehittime
Reduce hit time
CyclesMemoryStall
= CacheMisses x (MissLatencyTotal
M
S ll =CacheMissesx(MissLatency
T l MissLatencyOverlapped
O l
d)
Increaseoverlappedmisslatency
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
EE/CS520 Comp.Archi.
11/13/2012
6BasicCacheOptimizations
ReducingMissPenalty
1. GivingReadsPriorityoverWrites
R d
Readcouldcompletebeforeearlierwritesinwritebuffer
ld
l t b f
li
it i
it b ff
2. MultilevelCaches
ReducingMissRate
R
d i Mi R t
3. LargerBlocksize
4. LargerCachesize
g
5. HigherAssociativity
Reducinghittime
6. Avoidingaddresstranslationduringcacheindexing
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
Largerblocksize
1.
Largerblocktakeadvantageofspatiallocality
Decreasesmissrate
Increasesmisspenalty
Reducetheno.ofblocksincache
Moreconflictmisses
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
Largercaches
2.
Reducesmissrate
R
d
i
t
Hittimeincreasesasneedtolookupbiggermemory
Higherpower,areaconsumptionandcost
Higherassociativity
3.
An8waysetassociativeisaseffectiveasafullyassociativecache
An
8way setassociative is as effective as a fully associative cache
(inpractice)
2:1cacheruleofthumb
Miss rate of N size directmapped
MissrateofNsizedirect
mappedcache
cache =MissrateofN/2size2
Miss rate of N/2 size 2way
wayset
setassociative
associativecache
cache
Reducesmissrate
Increaseshittime
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
4.
Multilevelcaches
VeryFast,smallLevel1(L1)cache
Fast,notsosmallLevel2(L2)cache
Mayalsohaveslower,largeL3cache,etc.
May also have slower large L3 cache etc
Whydoesthishelp?
MissinL1cachecanhitinL2cache,etc.
AMAT=HitTimeL1+MissRateL1MissPenaltyL1
MissPenaltyL1=HitTimeL2+MissRateL2MissPenaltyL2
MissPenaltyL2=HitTime
HitTimeL3+MissRateL3MissPenaltyL3
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
5.
Giveprioritytoreadmissesoverwrites
Reducesmisspenalty
Mostlyusedinwritethroughschemethatuseswritebuffers
WritebuffercancauseRAWhazardsastheyholdupdateddata
Write buffer can cause RAW hazards as they hold updated data
neededonareadmiss
Inthisscheme,checkwritebufferforavalueonreadmiss
Ifthereisnoconflict,andmemorysystemisavailable,readcompletes
earlierthanthewritespendinginwritebuffer
10
EE/CS520 Comp.Archi.
11/13/2012
BasicCacheOptimizations
6.
Avoidingaddresstranslationduringcacheindexing
Willbediscussedlater,oncewereachvirtualmemory
11
EE/CS520 Comp.Archi.
11/13/2012
AdvancedCacheOptimizations
12
EE/CS520 Comp.Archi.
11/13/2012
AdvancedOptimizationsforCaches
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
3 Tracecaches
3.
Trace caches
Increasingcachebandwidth
g
(forsuperscalars)
4. Pipelinedcaches
5. Multibankedcaches
6. Nonblockingcaches
13
EE/CS520 Comp.Archi.
ReducingMissPenalty
7. Criticalwordfirst
8. Mergingwritebuffers
ReducingMissRate
9 VictimCache
9.
Victim Cache
10. Hardwareprefetching
11. Compilerprefetching
p
p
g
12. CompilerOptimizations
11/13/2012
1.SmallandSimpleCaches
1st Idea:keepitsmall
Translatingindexportionandcomparingtagportionistime
consuming
KeepL2smallenoughtofititonchiptoavoidoffchiptimepenalty
p
g
p
p
p
y
2nd idea:Keepitsimple
Adirectmappedcachehaslesshittimethanasetassociative
Example:overallonchipcachesizeincreasedbutL1sizeconst.
3generationsofAMDhavesameL1cachesize
3
i
f AMD h
L1
h i
14
EE/CS520 Comp.Archi.
11/13/2012
1.SmallandSimpleCaches
1.00x
1.32x
1.39x
1.43x
Cache Size
15
EE/CS520 Comp.Archi.
11/13/2012
1.SmallandSimpleCaches:Example
4wayL1cache1.1xslowerthan2wayL1cache
Missrate(2way)=0.049
Mi
(2
) 0 049
Missrate(4way)=0.044
Hittime=1cc(cacheisoncriticalpathofprocessor)
MisspenaltytoL2=10cc
Whichoneisfaster?
Solution:
AMAT(2way)=hittime+missratexmisspenalty=1+0.049x10=1.49
Fora4waycachehittime=1.1xlonger=1.1cc
F
4
h hit ti
11 l
11
MisspenaltyshouldbethesameasitdependsonL2speedandnotonprocessor
Assumeitis9ccofthelongerclock
(4 a ) 1 1 + 0 044 9 1 50
AMAT
AMAT(4way)=1.1+0.044x9=1.50
16
A2wayL1cacheisbetterplusinrealityifprocessorclockisreallyelongatedby1.1x,it
wouldworsentheperformanceassystemwouldbeslowerevenifitsnotaccessingcache
11/13/2012
EE/CS520 Comp.Archi.
2.WayPrediction
UsedinNwaysetassociativecaches
ExtrabitsarekepttopredictwhichofNwayshasourdatablock
MUXissetearlytoselectthedesiredblock
Extrabitsarecalledblockpredictor
Extra bits are called blockpredictor bits
CouldbeasfastasdirectmappedcacheifpredictionisOK
Onmiss,checkotherblocksformatchesinnextCC
,
Simulationssuggest
>85%accuracyfora2wayset
Goodmatchforspeculativeprocessors
UsedinPentium4
17
EE/CS520 Comp.Archi.
11/13/2012
3.TraceCaches
AlreadydiscussedinPentium4casestudy
UsedforIcacheonly
Holdsadynamictraceofinststobeexecuted
Canworkbeyondbranches
C
kb
db
h
Usestemporallocalityinsteadoftraditionalspatial
locality of Icaches
localityofI
caches
18
EE/CS520 Comp.Archi.
11/13/2012
3.TraceCaches
4
F
A
C
D
2
Trace Cache
MultipleIcacheLinespercycle
Traditional I-Cache
Cansufferfrominstduplication
Sameinstcanbeapartofdifferenttraces
ABC,DEA,XABlowerspaceefficiency
Expensiveinarea,powerandcomplexity
Wasaonetimeinnovation
Was a one time innovation
19
EE/CS520 Comp.Archi.
11/13/2012
4.PipelinedCaches
Pipelinethecacheaccess
EffectivelatencyofL1cachehitcanbemultiplecc
ratherthan1cc
Fastercctimeandhighbandwidth
F t
ti
d hi h b d idth
Butaslowerhittime
Essential
EssentialforL1cachesathighfrequency
for L1 caches at high frequency
EvensmallL1cacheshave23cc@GHz
Example:
Pentium
L1cache hittime=1cc
PentiumML1cache hittime=2cc
Pentium4L1cache
Pentium 4 L1 cache hittime=4cc
hit time 4 cc
20
EE/CS520 Comp.Archi.
11/13/2012
5.NonBlockingCaches
CyclesMemoryStall =CacheMissesx(MissLatencyTotal MissLatencyOverlapped)
Idea:overlapmisslatencywith
usefulwork
Alsocalledlatencyhiding
Ablockingcacheservicesone
A blocking cache services one
accessatatime
Whilemississerviced,other
accessesareblocked(wait)
RememberTomasulo
Remember Tomasulossexample
example
withloop
FirstLDhadacachemiss,2nd LD
hadtowaitforthe1st tocomplete
HitUnder1Miss
Allowcachehitswhileonemiss
inprogress
Butanothermisshastowait
MissUnderMiss,HitUnder
MultipleMisses
Allowhitsandmisseswhenother
missesinprogress
misses
in progress
Memorysystemmustallow
multiplependingrequests
Nonblockingcachesremovethis
l
limitation
Whilemissserviced,canprocess
otherrequests
Calledhitundermiss
optimization
ti i ti
21
EE/CS520 Comp.Archi.
11/13/2012
5.NonBlockingCaches
FPPrograms
22
EE/CS520 Comp.Archi.
IntegerPrograms
11/13/2012
5.NonBlockingCaches:Example
WhichismoreimportantforFPprograms?
1. 2waysetassociativity
2. Hitunderonemiss(withDirectMapped(DM)cache)
Hi
d
i ( i h Di
M
d (DM)
h )
Avg.missrate=11.4%(directmapped),10.7%(2way)
Samequestionforintegerprograms
q
g p g
Avg.missrate=7.4%(directmapped),6%(2way)
Assumeavg.memorystalltime=missratexmisspenalty
MisspenaltytoL2=16cc
23
EE/CS520 Comp.Archi.
11/13/2012
Solution:
ForFPPrograms:
Forintegerprograms:
24
Avg.memorystalltime(DirectMapped,DM)=11.4%x16=1.84
A
t ll ti (Di t M
d DM) 11 4% 16 1 84
Avg.memorystalltime(2waysetassociative)=10.7%x16=1.71
Memorystalls(2way)relativetoDM=1.71/1.84=93%
M
Memorystallofhitunderonemiss=73%(fromprev.graph)
t ll f hit d
i
73% (f
h)
HenceDMwithhitunderonemissisbetterthan2waysetassociative
Avg.memorystalltime(DM)=7.4%x16=1.18
Avg.memorystalltime(2way)=6%x16=0.96
Memorystalls(2way)relativetoDM=0.96/1.18=81%
Memorystallofhitunderonemiss=81%(fromprev.graph)
HenceDMwithhitunderonemissand2waysetassociativegiveequalperformance
EE/CS520 Comp.Archi.
11/13/2012
6.MultiBankedCache
Dividecacheintoindependentbanksthatsupportsimultaneousaccesses
CacheBandwidthisincreased
AMDOpteronL2=2banks,SunNiagaraL2=4banks
25
EE/CS520 Comp.Archi.
11/13/2012
7.CriticalWordFirstandEarlyRestart
Criticalwordfirst
Requestthemissedwordfirstfrommemory
SendittoprocessorASAPandlettheprocessorcontinueitsworking
Keepfillingtherestoftheblockincache
Keep filling the rest of the block in cache
Earlyrestart:
Fetchthewordsinorder
Assoonastherequestedwordarrives,sendittoprocessor
Keepfetchingrestoftheblock
Bothtechniquesareefficientonlyiftheblocksizeislarge
B th t h i
ffi i t l if th bl k i i l
26
EE/CS520 Comp.Archi.
11/13/2012
8.MergingWriteBuffer
Ifmultiplewritemissesoccurtothesameblock,combine
theminthewritebuffer
h
i h
i b ff
Useblockwriteinsteadofamanysmallwordwrites
27
EE/CS520 Comp.Archi.
11/13/2012
9.VictimCaches
Recentlykickedoutblockskeptinsmallcache
Ifwemissonthoseblocks,cangetthemfasterthan
Efficientwithconflictmisses
Victimcachepreventsthrashing whenseveral
popularblockswanttogotothesameentry
l bl k
tt
t th
t
28
EE/CS520 Comp.Archi.
11/13/2012
10.HardwarePrefetching
Predictfutureneedsofprocessorandgetdataintocache
If
Ifaccessdoeshappen,wehaveahit
d
h
h
h
Ifaccessdoesnothappen,cachepollution
Usefuldatareplacedwithjunk
Toavoidpollution,prefetchbuffers
Pollutionabigproblemforsmallcaches
Haveasmallseparatebufferforprefetches
Have a small separate buffer for prefetches
Whenwedoaccessit,putdataincache
Ifwedontaccessit,cachenotpolluted
Prefetchreliesonusingunusedmemorybandwidth
Ifnotso,couldresultinperformancedegradation
29
EE/CS520 Comp.Archi.
11/13/2012