You are on page 1of 29

Lecture20

Caches OptimizationTechniques

MemoryOrganization
DataLocality
Temporal:ifonedataitemneedednow,
Temporal: if one data item needed now

itislikelytobeneededagaininnearfuture
Spatial:ifonedataitemneedednow,
p
nearbydatalikelytobeneededinnearfuture
ExploitingLocality:Caches
Keeprecentlyuseddata infastmemoryclosetothe processor
Alsobringnearbydatathere
Also bring nearby data there

EE/CS520 Comp.Archi.

11/13/2012

MemoryOrganization
Basicidea:
Implementamemoryhierarchy:
Smallsize,fast,closetoprocessor
Largesize,slow,farfromprocessor
Capacity+
Speed

Disk
MainMemory
y
L3Cache
L2Cache

ITLB

InstructionCache

DataCache

DTLB

RegisterFile
BypassNetwork

EE/CS520 Comp.Archi.

Speed+
Capacity
11/13/2012

ImprovingCachePerformance
AMAT=hittime+missrate*misspenalty
Reducemisspenalty
Reducemissrate
Reducehittime
Reduce hit time
CyclesMemoryStall
= CacheMisses x (MissLatencyTotal
M
S ll =CacheMissesx(MissLatency
T l MissLatencyOverlapped
O l
d)

Increaseoverlappedmisslatency

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations

EE/CS520 Comp.Archi.

11/13/2012

6BasicCacheOptimizations

ReducingMissPenalty
1. GivingReadsPriorityoverWrites

R d
Readcouldcompletebeforeearlierwritesinwritebuffer
ld
l t b f
li
it i
it b ff

2. MultilevelCaches

ReducingMissRate
R
d i Mi R t
3. LargerBlocksize
4. LargerCachesize
g
5. HigherAssociativity

Reducinghittime

6. Avoidingaddresstranslationduringcacheindexing

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations
Largerblocksize

1.

Largerblocktakeadvantageofspatiallocality

Decreasesmissrate

Increasesmisspenalty

Reducetheno.ofblocksincache

Moreconflictmisses

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations
Largercaches

2.

Reducesmissrate
R
d
i
t
Hittimeincreasesasneedtolookupbiggermemory
Higherpower,areaconsumptionandcost

Higherassociativity

3.

An8waysetassociativeisaseffectiveasafullyassociativecache
An
8way setassociative is as effective as a fully associative cache
(inpractice)
2:1cacheruleofthumb
Miss rate of N size directmapped
MissrateofNsizedirect
mappedcache
cache =MissrateofN/2size2
Miss rate of N/2 size 2way
wayset
setassociative
associativecache
cache

Reducesmissrate
Increaseshittime

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations
4.

Multilevelcaches
VeryFast,smallLevel1(L1)cache
Fast,notsosmallLevel2(L2)cache
Mayalsohaveslower,largeL3cache,etc.
May also have slower large L3 cache etc
Whydoesthishelp?
MissinL1cachecanhitinL2cache,etc.
AMAT=HitTimeL1+MissRateL1MissPenaltyL1
MissPenaltyL1=HitTimeL2+MissRateL2MissPenaltyL2
MissPenaltyL2=HitTime
HitTimeL3+MissRateL3MissPenaltyL3

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations
5.

Giveprioritytoreadmissesoverwrites
Reducesmisspenalty
Mostlyusedinwritethroughschemethatuseswritebuffers
WritebuffercancauseRAWhazardsastheyholdupdateddata
Write buffer can cause RAW hazards as they hold updated data
neededonareadmiss
Inthisscheme,checkwritebufferforavalueonreadmiss
Ifthereisnoconflict,andmemorysystemisavailable,readcompletes
earlierthanthewritespendinginwritebuffer

10

EE/CS520 Comp.Archi.

11/13/2012

BasicCacheOptimizations

6.

Avoidingaddresstranslationduringcacheindexing
Willbediscussedlater,oncewereachvirtualmemory

11

EE/CS520 Comp.Archi.

11/13/2012

AdvancedCacheOptimizations

12

EE/CS520 Comp.Archi.

11/13/2012

AdvancedOptimizationsforCaches
Reducinghittime
1. Smallandsimplecaches
2. Wayprediction
3 Tracecaches
3.
Trace caches
Increasingcachebandwidth
g

(forsuperscalars)
4. Pipelinedcaches
5. Multibankedcaches
6. Nonblockingcaches

13

EE/CS520 Comp.Archi.

ReducingMissPenalty
7. Criticalwordfirst
8. Mergingwritebuffers
ReducingMissRate
9 VictimCache
9.
Victim Cache
10. Hardwareprefetching
11. Compilerprefetching
p
p
g
12. CompilerOptimizations

11/13/2012

1.SmallandSimpleCaches
1st Idea:keepitsmall
Translatingindexportionandcomparingtagportionistime
consuming
KeepL2smallenoughtofititonchiptoavoidoffchiptimepenalty
p
g
p
p
p
y
2nd idea:Keepitsimple
Adirectmappedcachehaslesshittimethanasetassociative
Example:overallonchipcachesizeincreasedbutL1sizeconst.
3generationsofAMDhavesameL1cachesize
3
i
f AMD h
L1
h i

14

EE/CS520 Comp.Archi.

11/13/2012

1.SmallandSimpleCaches
1.00x
1.32x
1.39x
1.43x

Cache Size

15

EE/CS520 Comp.Archi.

11/13/2012

1.SmallandSimpleCaches:Example
4wayL1cache1.1xslowerthan2wayL1cache
Missrate(2way)=0.049
Mi
(2
) 0 049
Missrate(4way)=0.044
Hittime=1cc(cacheisoncriticalpathofprocessor)
MisspenaltytoL2=10cc
Whichoneisfaster?
Solution:
AMAT(2way)=hittime+missratexmisspenalty=1+0.049x10=1.49
Fora4waycachehittime=1.1xlonger=1.1cc
F
4
h hit ti
11 l
11
MisspenaltyshouldbethesameasitdependsonL2speedandnotonprocessor
Assumeitis9ccofthelongerclock

(4 a ) 1 1 + 0 044 9 1 50
AMAT
AMAT(4way)=1.1+0.044x9=1.50
16

A2wayL1cacheisbetterplusinrealityifprocessorclockisreallyelongatedby1.1x,it
wouldworsentheperformanceassystemwouldbeslowerevenifitsnotaccessingcache
11/13/2012
EE/CS520 Comp.Archi.

2.WayPrediction
UsedinNwaysetassociativecaches
ExtrabitsarekepttopredictwhichofNwayshasourdatablock
MUXissetearlytoselectthedesiredblock
Extrabitsarecalledblockpredictor
Extra bits are called blockpredictor bits

CouldbeasfastasdirectmappedcacheifpredictionisOK
Onmiss,checkotherblocksformatchesinnextCC
,
Simulationssuggest
>85%accuracyfora2wayset

Goodmatchforspeculativeprocessors
UsedinPentium4

17

EE/CS520 Comp.Archi.

11/13/2012

3.TraceCaches
AlreadydiscussedinPentium4casestudy
UsedforIcacheonly
Holdsadynamictraceofinststobeexecuted
Canworkbeyondbranches
C
kb
db
h
Usestemporallocalityinsteadoftraditionalspatial

locality of Icaches
localityofI
caches

18

EE/CS520 Comp.Archi.

11/13/2012

3.TraceCaches
4

F
A

C
D

2
Trace Cache
MultipleIcacheLinespercycle

Traditional I-Cache

Cansufferfrominstduplication
Sameinstcanbeapartofdifferenttraces
ABC,DEA,XABlowerspaceefficiency
Expensiveinarea,powerandcomplexity
Wasaonetimeinnovation
Was a one time innovation

19

EE/CS520 Comp.Archi.

11/13/2012

4.PipelinedCaches
Pipelinethecacheaccess
EffectivelatencyofL1cachehitcanbemultiplecc

ratherthan1cc
Fastercctimeandhighbandwidth
F t
ti
d hi h b d idth
Butaslowerhittime
Essential
EssentialforL1cachesathighfrequency
for L1 caches at high frequency
EvensmallL1cacheshave23cc@GHz

Example:
Pentium

L1cache hittime=1cc
PentiumML1cache hittime=2cc
Pentium4L1cache
Pentium 4 L1 cache hittime=4cc
hit time 4 cc
20

EE/CS520 Comp.Archi.

11/13/2012

5.NonBlockingCaches
CyclesMemoryStall =CacheMissesx(MissLatencyTotal MissLatencyOverlapped)
Idea:overlapmisslatencywith

usefulwork

Alsocalledlatencyhiding

Ablockingcacheservicesone
A blocking cache services one

accessatatime

Whilemississerviced,other

accessesareblocked(wait)
RememberTomasulo
Remember Tomasulossexample
example
withloop
FirstLDhadacachemiss,2nd LD

hadtowaitforthe1st tocomplete

HitUnder1Miss

Allowcachehitswhileonemiss

inprogress
Butanothermisshastowait
MissUnderMiss,HitUnder

MultipleMisses

Allowhitsandmisseswhenother

missesinprogress
misses
in progress
Memorysystemmustallow
multiplependingrequests

Nonblockingcachesremovethis

l
limitation

Whilemissserviced,canprocess

otherrequests
Calledhitundermiss
optimization
ti i ti
21

EE/CS520 Comp.Archi.

11/13/2012

5.NonBlockingCaches

FPPrograms

22

EE/CS520 Comp.Archi.

IntegerPrograms

11/13/2012

5.NonBlockingCaches:Example
WhichismoreimportantforFPprograms?
1. 2waysetassociativity
2. Hitunderonemiss(withDirectMapped(DM)cache)
Hi
d
i ( i h Di
M
d (DM)
h )
Avg.missrate=11.4%(directmapped),10.7%(2way)
Samequestionforintegerprograms
q
g p g
Avg.missrate=7.4%(directmapped),6%(2way)
Assumeavg.memorystalltime=missratexmisspenalty
MisspenaltytoL2=16cc

23

EE/CS520 Comp.Archi.

11/13/2012

Solution:

ForFPPrograms:

Forintegerprograms:

24

Avg.memorystalltime(DirectMapped,DM)=11.4%x16=1.84
A
t ll ti (Di t M
d DM) 11 4% 16 1 84
Avg.memorystalltime(2waysetassociative)=10.7%x16=1.71
Memorystalls(2way)relativetoDM=1.71/1.84=93%
M
Memorystallofhitunderonemiss=73%(fromprev.graph)
t ll f hit d
i
73% (f
h)
HenceDMwithhitunderonemissisbetterthan2waysetassociative

Avg.memorystalltime(DM)=7.4%x16=1.18
Avg.memorystalltime(2way)=6%x16=0.96
Memorystalls(2way)relativetoDM=0.96/1.18=81%
Memorystallofhitunderonemiss=81%(fromprev.graph)
HenceDMwithhitunderonemissand2waysetassociativegiveequalperformance

EE/CS520 Comp.Archi.

11/13/2012

6.MultiBankedCache
Dividecacheintoindependentbanksthatsupportsimultaneousaccesses
CacheBandwidthisincreased
AMDOpteronL2=2banks,SunNiagaraL2=4banks

25

EE/CS520 Comp.Archi.

11/13/2012

7.CriticalWordFirstandEarlyRestart
Criticalwordfirst
Requestthemissedwordfirstfrommemory
SendittoprocessorASAPandlettheprocessorcontinueitsworking
Keepfillingtherestoftheblockincache
Keep filling the rest of the block in cache
Earlyrestart:
Fetchthewordsinorder
Assoonastherequestedwordarrives,sendittoprocessor
Keepfetchingrestoftheblock
Bothtechniquesareefficientonlyiftheblocksizeislarge
B th t h i
ffi i t l if th bl k i i l

26

EE/CS520 Comp.Archi.

11/13/2012

8.MergingWriteBuffer
Ifmultiplewritemissesoccurtothesameblock,combine

theminthewritebuffer
h
i h
i b ff
Useblockwriteinsteadofamanysmallwordwrites

27

EE/CS520 Comp.Archi.

11/13/2012

9.VictimCaches
Recentlykickedoutblockskeptinsmallcache
Ifwemissonthoseblocks,cangetthemfasterthan

if they are to be fetched from main memory


iftheyaretobefetchedfrommainmemory

Efficientwithconflictmisses
Victimcachepreventsthrashing whenseveral

popularblockswanttogotothesameentry
l bl k
tt
t th
t

28

EE/CS520 Comp.Archi.

11/13/2012

10.HardwarePrefetching
Predictfutureneedsofprocessorandgetdataintocache
If
Ifaccessdoeshappen,wehaveahit
d
h
h
h
Ifaccessdoesnothappen,cachepollution
Usefuldatareplacedwithjunk

Toavoidpollution,prefetchbuffers
Pollutionabigproblemforsmallcaches
Haveasmallseparatebufferforprefetches
Have a small separate buffer for prefetches
Whenwedoaccessit,putdataincache
Ifwedontaccessit,cachenotpolluted

Prefetchreliesonusingunusedmemorybandwidth
Ifnotso,couldresultinperformancedegradation

29

EE/CS520 Comp.Archi.

11/13/2012

You might also like