Lect20 Caches Review2 PDF

Lecture20
Caches OptimizationTechniques
MemoryOrganization
DataLocality
Temporal:ifonedataitemneedednow,
Temporal: if one data item needed now
itislikelytobeneededagaininnearfuture
Spatial:ifonedataitemneedednow,
p
nearbydatalikelytobeneededinnearfuture
ExploitingLocality:Caches
Keeprecentlyuseddata infastmemoryclosetothe processor
Alsobringnearbydatathere
Also bring nearby data there
EE/CS520 Comp.Archi.
11/13/2012
MemoryOrganization
Basicidea:
Implementamemoryhierarchy:
Smallsize,fast,closetoprocessor
Largesize,slow,farfromprocessor
Capacity+
Speed
Disk
MainMemory
y
L3Cache
L2Cache
ITLB
InstructionCache
DataCache
DTLB
RegisterFile
BypassNetwork
Speed+
Capacity
11/13/2012
ImprovingCachePerformance
AMAT=hittime+missrate*misspenalty
Reducemisspenalty
Reducemissrate
Reducehittime
Reduce hit time
CyclesMemoryStall
= CacheMisses x (MissLatencyTotal
M
S ll =CacheMissesx(MissLatency
T l MissLatencyOverlapped
O l
d)
Increaseoverlappedmisslatency
11/13/2012
BasicCacheOptimizations
11/13/2012
6BasicCacheOptimizations
ReducingMissPenalty
1. GivingReadsPriorityoverWrites
R d
Readcouldcompletebeforeearlierwritesinwritebuffer
ld
l t b f
li
it i
it b ff
2. MultilevelCaches
ReducingMissRate
R
d i Mi R t
3. LargerBlocksize
4. LargerCachesize
g
5. HigherAssociativity
Reducinghittime
6. Avoidingaddresstranslationduringcacheindexing
11/13/2012
Largerblocksize
1.
Largerblocktakeadvantageofspatiallocality
Decreasesmissrate
Increasesmisspenalty
Reducetheno.ofblocksincache
Moreconflictmisses
11/13/2012
Largercaches
2.
Reducesmissrate
R
d
i
t
Hittimeincreasesasneedtolookupbiggermemory
Higherpower,areaconsumptionandcost
Higherassociativity
3.
An8waysetassociativeisaseffectiveasafullyassociativecache
An
8way setassociative is as effective as a fully associative cache
(inpractice)
2:1cacheruleofthumb
Miss rate of N size directmapped
MissrateofNsizedirect
mappedcache
cache =MissrateofN/2size2
Miss rate of N/2 size 2way
wayset
setassociative
associativecache
cache
Reducesmissrate
Increaseshittime
11/13/2012
4.
Multilevelcaches
VeryFast,smallLevel1(L1)cache
Fast,notsosmallLevel2(L2)cache
Mayalsohaveslower,largeL3cache,etc.
May also have slower large L3 cache etc
Whydoesthishelp?
MissinL1cachecanhitinL2cache,etc.
AMAT=HitTimeL1+MissRateL1MissPenaltyL1
MissPenaltyL1=HitTimeL2+MissRateL2MissPenaltyL2
MissPenaltyL2=HitTime
HitTimeL3+MissRateL3MissPenaltyL3
11/13/2012
5.
Giveprioritytoreadmissesoverwrites
Reducesmisspenalty
Mostlyusedinwritethroughschemethatuseswritebuffers
WritebuffercancauseRAWhazardsastheyholdupdateddata
Write buffer can cause RAW hazards as they hold updated data
neededonareadmiss
Inthisscheme,checkwritebufferforavalueonreadmiss
Ifthereisnoconflict,andmemorysystemisavailable,readcompletes
earlierthanthewritespendinginwritebuffer
10
11/13/2012
6.
Avoidingaddresstranslationduringcacheindexing
Willbediscussedlater,oncewereachvirtualmemory
11
11/13/2012
AdvancedCacheOptimizations
12
11/13/2012
AdvancedOptimizationsforCaches
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
3 Tracecaches
3.
Trace caches
Increasingcachebandwidth
g
(forsuperscalars)
4. Pipelinedcaches
5. Multibankedcaches
6. Nonblockingcaches
13
ReducingMissPenalty
7. Criticalwordfirst
8. Mergingwritebuffers
ReducingMissRate
9 VictimCache
9.
Victim Cache
10. Hardwareprefetching
11. Compilerprefetching
p
p
g
12. CompilerOptimizations
11/13/2012
1.SmallandSimpleCaches
1st Idea:keepitsmall
Translatingindexportionandcomparingtagportionistime
consuming
KeepL2smallenoughtofititonchiptoavoidoffchiptimepenalty
p
g
p
p
p
y
2nd idea:Keepitsimple
Adirectmappedcachehaslesshittimethanasetassociative
Example:overallonchipcachesizeincreasedbutL1sizeconst.
3generationsofAMDhavesameL1cachesize
3
i
f AMD h
L1
h i
14
11/13/2012
1.SmallandSimpleCaches
1.00x
1.32x
1.39x
1.43x
Cache Size
15
11/13/2012
1.SmallandSimpleCaches:Example
4wayL1cache1.1xslowerthan2wayL1cache
Missrate(2way)=0.049
Mi
(2
) 0 049
Missrate(4way)=0.044
Hittime=1cc(cacheisoncriticalpathofprocessor)
MisspenaltytoL2=10cc
Whichoneisfaster?
Solution:
AMAT(2way)=hittime+missratexmisspenalty=1+0.049x10=1.49
Fora4waycachehittime=1.1xlonger=1.1cc
F
4
h hit ti
11 l
11
MisspenaltyshouldbethesameasitdependsonL2speedandnotonprocessor
Assumeitis9ccofthelongerclock
(4 a ) 1 1 + 0 044 9 1 50
AMAT
AMAT(4way)=1.1+0.044x9=1.50
16
A2wayL1cacheisbetterplusinrealityifprocessorclockisreallyelongatedby1.1x,it
wouldworsentheperformanceassystemwouldbeslowerevenifitsnotaccessingcache
11/13/2012
2.WayPrediction
UsedinNwaysetassociativecaches
ExtrabitsarekepttopredictwhichofNwayshasourdatablock
MUXissetearlytoselectthedesiredblock
Extrabitsarecalledblockpredictor
Extra bits are called blockpredictor bits
CouldbeasfastasdirectmappedcacheifpredictionisOK
Onmiss,checkotherblocksformatchesinnextCC
,
Simulationssuggest
>85%accuracyfora2wayset
Goodmatchforspeculativeprocessors
UsedinPentium4
17
11/13/2012
3.TraceCaches
AlreadydiscussedinPentium4casestudy
UsedforIcacheonly
Holdsadynamictraceofinststobeexecuted
Canworkbeyondbranches
C
kb
db
h
Usestemporallocalityinsteadoftraditionalspatial
locality of Icaches
localityofI
caches
18
11/13/2012
3.TraceCaches
4
F
A
C
D
2
Trace Cache
MultipleIcacheLinespercycle
Traditional I-Cache
Cansufferfrominstduplication
Sameinstcanbeapartofdifferenttraces
ABC,DEA,XABlowerspaceefficiency
Expensiveinarea,powerandcomplexity
Wasaonetimeinnovation
Was a one time innovation
19
11/13/2012
4.PipelinedCaches
Pipelinethecacheaccess
EffectivelatencyofL1cachehitcanbemultiplecc
ratherthan1cc
Fastercctimeandhighbandwidth
F t
ti
d hi h b d idth
Butaslowerhittime
Essential
EssentialforL1cachesathighfrequency
for L1 caches at high frequency
EvensmallL1cacheshave23cc@GHz
Example:
Pentium
L1cache hittime=1cc
PentiumML1cache hittime=2cc
Pentium4L1cache
Pentium 4 L1 cache hittime=4cc
hit time 4 cc
20
11/13/2012
5.NonBlockingCaches
CyclesMemoryStall =CacheMissesx(MissLatencyTotal MissLatencyOverlapped)
Idea:overlapmisslatencywith
usefulwork
Alsocalledlatencyhiding
Ablockingcacheservicesone
A blocking cache services one
accessatatime
Whilemississerviced,other
accessesareblocked(wait)
RememberTomasulo
Remember Tomasulossexample
example
withloop
FirstLDhadacachemiss,2nd LD
hadtowaitforthe1st tocomplete
HitUnder1Miss
Allowcachehitswhileonemiss
inprogress
Butanothermisshastowait
MissUnderMiss,HitUnder
MultipleMisses
Allowhitsandmisseswhenother
missesinprogress
misses
in progress
Memorysystemmustallow
multiplependingrequests
Nonblockingcachesremovethis
l
limitation
Whilemissserviced,canprocess
otherrequests
Calledhitundermiss
optimization
ti i ti
21
11/13/2012
5.NonBlockingCaches
FPPrograms
22
IntegerPrograms
11/13/2012
5.NonBlockingCaches:Example
WhichismoreimportantforFPprograms?
1. 2waysetassociativity
2. Hitunderonemiss(withDirectMapped(DM)cache)
Hi
d
i ( i h Di
M
d (DM)
h )
Avg.missrate=11.4%(directmapped),10.7%(2way)
Samequestionforintegerprograms
q
g p g
Avg.missrate=7.4%(directmapped),6%(2way)
Assumeavg.memorystalltime=missratexmisspenalty
MisspenaltytoL2=16cc
23
11/13/2012
Solution:
ForFPPrograms:
Forintegerprograms:
24
Avg.memorystalltime(DirectMapped,DM)=11.4%x16=1.84
A
t ll ti (Di t M
d DM) 11 4% 16 1 84
Avg.memorystalltime(2waysetassociative)=10.7%x16=1.71
Memorystalls(2way)relativetoDM=1.71/1.84=93%
M
Memorystallofhitunderonemiss=73%(fromprev.graph)
t ll f hit d
i
73% (f
h)
HenceDMwithhitunderonemissisbetterthan2waysetassociative
Avg.memorystalltime(DM)=7.4%x16=1.18
Avg.memorystalltime(2way)=6%x16=0.96
Memorystalls(2way)relativetoDM=0.96/1.18=81%
Memorystallofhitunderonemiss=81%(fromprev.graph)
HenceDMwithhitunderonemissand2waysetassociativegiveequalperformance
11/13/2012
6.MultiBankedCache
Dividecacheintoindependentbanksthatsupportsimultaneousaccesses
CacheBandwidthisincreased
AMDOpteronL2=2banks,SunNiagaraL2=4banks
25
11/13/2012
7.CriticalWordFirstandEarlyRestart
Criticalwordfirst
Requestthemissedwordfirstfrommemory
SendittoprocessorASAPandlettheprocessorcontinueitsworking
Keepfillingtherestoftheblockincache
Keep filling the rest of the block in cache
Earlyrestart:
Fetchthewordsinorder
Assoonastherequestedwordarrives,sendittoprocessor
Keepfetchingrestoftheblock
Bothtechniquesareefficientonlyiftheblocksizeislarge
B th t h i
ffi i t l if th bl k i i l
26
11/13/2012
8.MergingWriteBuffer
Ifmultiplewritemissesoccurtothesameblock,combine
theminthewritebuffer
h
i h
i b ff
Useblockwriteinsteadofamanysmallwordwrites
27
11/13/2012
9.VictimCaches
Recentlykickedoutblockskeptinsmallcache
Ifwemissonthoseblocks,cangetthemfasterthan
if they are to be fetched from main memory

iftheyaretobefetchedfrommainmemory
Efficientwithconflictmisses
Victimcachepreventsthrashing whenseveral
popularblockswanttogotothesameentry
l bl k
tt
t th
t
28
11/13/2012
10.HardwarePrefetching
Predictfutureneedsofprocessorandgetdataintocache
If
Ifaccessdoeshappen,wehaveahit
d
h
h
h
Ifaccessdoesnothappen,cachepollution
Usefuldatareplacedwithjunk
Toavoidpollution,prefetchbuffers
Pollutionabigproblemforsmallcaches
Haveasmallseparatebufferforprefetches
Have a small separate buffer for prefetches
Whenwedoaccessit,putdataincache
Ifwedontaccessit,cachenotpolluted
Prefetchreliesonusingunusedmemorybandwidth
Ifnotso,couldresultinperformancedegradation
29
11/13/2012

Lect20 Caches Review2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect20 Caches Review2 PDF

Uploaded by

Copyright:

Available Formats

Lecture20

if they are to be fetched from main memory

You might also like