Computer Performance Modeling Techniques

Computer Performance Modeling
Goal: Estimating and understanding the

Practical Cache Performance performance of computer systems
Low-Level Models
Modeling for Computer Architects Various levels of details: functional, trace, cycle-accurate,
etc.
Pros
Versatile: adaptable to new architectures and workload
Details: can embed very detailed statistics
Yan Solihin, NCSU, solihin@ncsu.edu Popular: SimpleScalar, SIMICS widely used
Fei Guo, NCSU, fguo@ncsu.edu Cons
Thomas Puzak, IBM, trpuzak@us.ibm.com O(n) overhead, scales with workload size: each
Phil Emma, IBM, pemma@us.ibm.com instruction/event is simulated to capture its effect
Slow: realistic workload has many billions of instructions
Computer Performance Modeling Modeling Methods

High-Level Models White Box
Purpose: Evaluating gross trade-offs of designs Model incorporates knowledge about the system (e.g.
Pros relationships of parameters are known a priori)
Short execution time, and sometimes O(1) Analytical or heuristics-based
Requires little coding Pros: models reveal insights &explain “why”, no training
Reveal basic relationships of variables required
May reveal non-obvious trends and insights Cons: problem-specific solution
Cons Black Box
Less versatile Model learns knowledge about the system
Requires performance modeling expertise
AI-based: neural networks, decision tree, curve fitting, etc.
Uses:
Pros: can model complex problem/system
Early design cycle: for pruning design search space
Cons: prediction without insights, requires training
Entire design cycle: to re-evaluate design search space if
requirements change
3 4
1
Focus of this tutorial Program
Types: 8:30 – 8:45: Introduction
High Level, Hybrid High/Low level 8:45 – 9:00: Capturing temporal locality behavior
A priori knowledge: 9:00 – 9:30: Modeling cache sharing
White Box 9:30 – 10:00: Modeling cache replacement policy
Scope: 10:00 – 10:30: Coffee Break
Miss count & rate
10:30 – 11:30: Analysis of the effects of miss
Miss cost
clustering on the cost of a cache miss
Bandwidth usage
11:30 – 12:30: Interaction of Caching and Bandwidth
Pressure
5 6
Temporal Locality Behavior

Programs exhibit locality of references
Capturing Temporal Locality Spatial locality: the neighbor of recently-accessed
data tends to be accessed in the near future
Behavior Temporal locality: recently-accessed data tends to
be accessed again (or reused) in the near future
Significance: temporal reuse & cache parameters
determine all non-cold misses
If each memory block is accessed exactly once, we only
have cold misses
Cold misses affected by block size
How can we capture temporal locality?
2
Stack Distance Profiling [Mattson’70] Typical Shape
Early attempt to capture temporal reuse behavior Empirical observation ⇒ Geometric or exponential sequence
Models LRU stack with a counter for each stack position Due to temporal locality
Example: fully associative cache with 8-entry stack Ci+1 = Ci x r, where 0<r<1 is the common ratio
C1: incremented whenever the MRU block is accessed Incremented on access to MRU line
C2: incremented whenever the 2nd MRU block is accessed
30% Incremented on access to 2nd MRU line
C3: incremented whenever the 3rd MRU block is accessed
Percent of Accesses
25%
… Incremented on access to 3rd MRU line
20% Incremented on a miss
C8: incremented whenever the 8th MRU (or LRU) block is
accessed 15%
C>8: incremented whenever the 9th, 10th, … block is accessed 10%

5%
0%
C1 C2 C3 C4 C5 C6 C7 C8 C>8
Stack Distance Counters
9 10
Stack Distance Properties Where to Profile

For fully-associative LRU ∞ P
cache with A blocks, the
number of misses of the
Misses = ∑C
i = A+1
A
For capturing temporal reuse patterns at the
cache is L1 cache levels
⇒ Predict cache misses for various L1 cache
L1 Instruction L1 Data configurations
For A-way set associative
LRU cache, we can collect ∞ For capturing temporal reuse patterns at the
set-specific stack distance
profile, and the number of
misses of the set is:
Misses = ∑ CA
i = A +1 L2 Cache
L2 cache levels
⇒ Predict cache misses for various L2 cache
configurations
Alternatively, keep per-set

stacks, but use global set of
counters Mem
11 12
3
Limitations of Stack Distance Profile Definitions
Useful only for predicting cache misses across seq (d,n) = sequence of n accesses to d distinct addresses
different cache associativities (in a cache set)
cseq (d,n) (circular sequence) = a sequence in which the
For other purposes, we need to capture temporal first and the last accesses are to the same address
reuse patterns in greater details
So, use Circular Sequence Profile [Chandra’05]
Extends stack distance profiling seq(5,8)
Counts the occurrence of cseq(d,n) cseq(5,7)
A B C D A E E B
cseq(4,5) cseq(1,2)
13 14
Relationship with Stack Distance Profile Collecting Circular Sequence Profile

Cx = number of circular sequences cseq(d=x,n=any MRU LRU
value), or 1 2 3 4
∞ LRU Stack
C x = ∑ cseq( x, n) Access Counter
n= x
Access Stream for 1 Set
Hence, stack distance profile is a subset of circular d
1 2 3 4 A B C B A
sequence profile
1
n 2
3
4
5
…
15 16
4
Collecting Circular Sequence Profile Collecting Circular Sequence Profile
MRU LRU MRU LRU
1 2 3 4 1 2 3 4
LRU Stack A LRU Stack B A
Access Counter 1 Access Counter 1 2
Access Stream for 1 Set Access Stream for 1 Set

d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 3
4 4
5 5
… …
17 18

MRU LRU MRU LRU
1 2 3 4 1 2 3 4
Found a
LRU Stack C B A LRU Stack C B A
Circular Sequence!
Access Counter 1 2 3 Access Counter 2 3 4

d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 3 1
4 4
5 5
… …
19 20
5
MRU LRU MRU LRU
1 2 3 4 1 2 3 4
Found a
LRU Stack B C A LRU Stack B C A
Circular Sequence!
Access Counter 1 2 4 Access Counter 2 3 5

d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 1 3 1
4 4
5 5 1
… …
21 22
Collecting Circular Sequence Profile

MRU LRU
1 2 3 4
LRU Stack A B C
Predicting Contention Across
Access Counter 1 2 3 Cache
Access Stream for 1 Set
d
1 2 3 4 A B C B A
1
n 2
3 1
4
5 1
…
23
6
Shared Cache Challenge Impact of Cache Space Contention
400%
L2 Cache Misses
mcf's Normalized IPC

100%
350%
CMP 300% 80%
250% 60%
200%
40%
P P 150%
100% 20%
50%
0% 0%
L1 L1
mcf+mst
mcf+gzip
mcf+art
mcf+swim
Alone
mcf+swim
mcf+gzip
mcf+mst
mcf+art
Alone
L2 Cache
In today’s CMP, L2 cache is shared by multiple

cores Application-specific
Coschedule-specific
Applications on different core compete for L2 cache
Significant: Up to 4X cache misses, 65% IPC reduction
space
How to model the impact of cache sharing?
25 26
Modeling Goal Assumptions

Given n applications, predict the miss rates of any LRU Replacement Algorithm
pair of applications Applications share nothing
Input: Mostly true for sequential apps (except for library and OS
Behavior of each application code)
Cache parameter Applications not similar
Relative speed when the pair runs together Parallel apps: threads likely to show uniform behavior, so
Output: predicting their miss rates is trivial
Number of cache misses for each application in the pair
27 28
7
Circular Sequence Properties Example
Thread X runs alone in the system: Assume a 4-way associative cache:
Given a circular sequence cseqX(dX,nX), the last access is
a cache miss iff dX > Assoc X’s circular sequence Y’s intervening
cseqX(2,3) access sequence
Thread X shares the cache with thread Y: A B A U V V W
During cseqX(dX,nX)’s lifetime if there is a sequence of
intervening accesses seqY(dY,nY), the last access of thread lifetime
X is a miss iff dX+dY > Assoc
No cache sharing: A is a cache hit
Cache sharing: is A a cache hit or miss?
29 30
Example Inductive Probability Model

Assume a 4-way associative cache: Define Pmiss(cseqX) = probability of the last access is
a cache miss
X’s circular sequence Y’s intervening For each cseqX(dX,nX) of thread X
cseqX(2,3) access sequence Compute the number of intervening accesses from thread
A B A U V V W Y during cseqX’s lifetime ⇒ denote as nY
It is possible to have dY = 1, 2, … , nY ⇒ compute the
probability of each dY, denoted as P(seq(dY, nY)).
A U B V V A W A U B V V W A For each dY = 1, 2, … , nY
If dY + dX > Assoc, Pmiss(cseqX) = Pmiss(cseqX) + P(seq(dY, nY))
Cache Hit Cache Miss If dY + dX ≤ Assoc, Pmiss(cseqX) is kept the same
seqY(2,3) intervening in seqY(3,4) intervening in Misses = old_misses + ∑ Pmiss(cseqX) x F(cseqX)
cseqX’s lifetime cseqX’s lifetime
31 32
8
Computing P(seq(dY, nY)) Overall Formula
d
Basic Idea: Define: ∑C i
P (d − ) = i =1
and P(d + ) = 1 − P(d − )
seq(d,n) + 1 access to an ∞
already-seen address ∑C i
+ 1 access to a i.e. forming a circular sequence i =1
new address A B with 1..d distinct addresses
P(seq(d,n)) is computed by:
seq(d-1,n-1) seq(d,n-1) 1 d = n =1
 P ((d − 1) + ) × P( seq(d − 1, n − 1)
This is a Markov process with 3 states, and 2 edges  d = n >1
P(seq(d,n)) = A * P(seq(d-1,n-1)) + B * P(seq(d-1,n)) P( seq(d , n)) =  P (1− ) × P( seq(1, n − 1)) n > d =1
d  P (d − ) × P( seq(d , n − 1)) n > d >1
∑C i 
B= i =1
∞
and A = 1− B  + P((d − 1) + ) × P( seq(d − 1, n − 1))
∑ Ci
i =1
33 34
Example Final prediction

seq(2,3) After we obtain Pmiss(cseqX(dX,nX)) for all
cseqX(dX,nX),
P (1+ ) P (2 − )
Predict the total misses for thread X:
seq(1,2) seq(2,2) A
missX = oldmissX + ∑ Pmiss(cseqX (d X , nX )) × Cd X
P(1− ) P(1+ ) d X =1
seq(1,1) seq(1,1)
1 1
35 36
9
Observations
Based on how vulnerable to cache sharing impact:
Some are highly vulnerable

Some are not vulnerable
Many are somewhat / sometimes vulnerable
Modeling Replacement Policy
Insights:
Traditional characterizations: not indicative of impact of
Performance
sharing
Low vs. High IPC
Int vs. Floating-Point
High Miss Rate vs. Low Miss Rate
Rather, interaction of temporal reuse behavior determine
impact of cache sharing
37
Motivation Motivation
Cache design critical to performance Performance variation due to replacement policy is significant
L2 Miss Rates Normalized Exec Time
Memory wall: cache miss cost hundreds of processor
cycles 100%
18% 32%
100% 13% 19%
Capacity pressure: Multi-core design, Virtual Machine
80% 80%
67%
60% 47% 60%
40% 40%
20% 20%
0% 0%
art ammp cg art ammp cg
Important parameters: size, associativity, block size,
LRU Rand-MRUskw LRU Rand-MRUskw
and replacement policy No agreement on the “best implementation”
Intel Pentium: LRU
Intel XScale: FIFO
IBM Power 4: tree-based pseudo-LRU
Others: round robin, random, replacement hints, etc.
39 40
10
Motivation Would be useful to model replacement
policies
No analytical model, past models assume
LRU [Cascaval03, Chandra05, Ghosh97, Quong94, Sen02,
Singh92, and Suh01] App 1 Circular Seq
or Random [Agarwal89, Berg04, Ladner99] Profiling
...
Predicted miss rate for each app
App N Circular Seq on each replacement policy
Prediction
LRU/Random simplifies modeling, but Profiling
Model MissRate App 1 ... App N
Ignores performance variation due to replacement policy RP 1
RP 1’s Replacement
Inaccurate for highly associative caches Probability Function (RPF) ...
...
RP M’s Replacement RP M
Probability Function (RPF)
41 42
Outline Outline
Input of the Model Input of the Model
Replacement Policy Model Replacement Probability Function (RPF)
Case Study Circular Sequence Profiles
Conclusions Replacement Policy Model

Validation
Conclusions
43 44
11
Replacement Prob Function (RPF) Outline
RPF, denoted as Prepl(.) = a probability function, where Prepl(i) Input of the Model
is the probability that a cache block on the ith stack position
is replaced on a cache miss. Replacement Policy Model
Markov states
1 1 1
Prepl(i)
1 1
Markov state transitions
LRU NMRU1 NMRU4 0.8 Rand-MRUskw 0.8 Rand-LRUskw
0.5 0.5 0.5
0.6
0.4
0.6
0.4
Case Study
0.2 0.2
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
Conclusions
stack position i stack position i stack position i stack position i stack position i
Prepl(.) on 8-way assoc cache
Stack only needed for modeling, not necessarily for

hardware implementation
45 46
Tracking Cache Miss Probability Illustration

cseq(4,5) LRU Stack
cseq(4,5)
A B C D A
A B C D A
Target Access MRU LRU
Basic idea: (A = Target Block)
Reconstruct each circular sequence by adding each access
In the mean time, track if the target block is replaced Assume
Markov State = (d, n, p), where 4-way associative cache ⇒ 4-entry stack
d = number of distinct addresses yet to appear NMRU-2 replacement policy ⇒ Prepl(3) = Prepl(4) = 0.5
n = number of accesses yet to appear Goal
p = current stack position of the target block Compute the probability that the target access misses
47 48
12
Illustration Illustration
LRU Stack LRU Stack

cseq(4,5) cseq(4,5)
A A
MRU LRU MRU LRU
Initial state: (d=4, n=5, p=∞) Current state: (d=3, n=4, p=1)
49 50
LRU Stack LRU Stack

cseq(4,5) cseq(4,5)
A B B A A B C C B A
MRU LRU MRU LRU
Current state: (d=2, n=3, p=2) Current state: (d=1, n=2, p=3)
51 52
13
LRU Stack LRU Stack

cseq(4,5) A replaced
cseq(4,5)
A B C D C B A with a probability A B C D D C B
of 1/2
MRU LRU MRU LRU
Current state: (d=0, n=1, p=?) Final state: (d=0, n=1, p=∞)
53 54
Illustration Modeling Overview

Track current state and transition probabilities into
LRU Stack
cseq(4,5) new states
A B C D A D C B Final states:
Cache Miss! MRU LRU Target block replaced ⇒ cache miss
p > cache associativity ⇒ cache miss
Final state: (d=0, n=1, p=∞) End of circular sequence reached
Accumulate probabilities of cache miss
So the probability cseq(4,5)’s target access is a
cache miss is 0.5
55 56
14
State Transitions State Transitions Diagram
New state depends on 8 events:

Dist and NoDist (d-1,n-1,p) (d,n-1,p)
Dist: the new access is to “distinct” address (not seen before in
this circular sequence) 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit
Miss and Hit
Miss: the new access is a cache miss 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
Rp and NoRp (d,n,p)
Rp: the new access causes the target block to be replaced
Shift and NoShift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift
Shift: the new access causes the target block to be shifted down
in the LRU stack
(d-1,n-1,p+1) (d,n-1,p+1)
Note:
PDist, PRp, PShift are directly computable (see [SIGMETRICS’06]) 3: Dist, Miss, Rp 6: NoDist, Miss, Rp
PShift dependent on RPF
PMiss is the object of prediction End of State
57 58
State Transitions Diagram State Transitions Diagram
(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)

1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit
2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
(d,n,p) (d,n,p)
7: Dist, Hit 5: NoDist, Miss, NoRp, Shift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift
(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)

3: Dist, Miss, Rp 6: NoDist, Miss, Rp 3: Dist, Miss, Rp 6: NoDist, Miss, Rp
End of State End of State
59 60
15
State Transitions Diagram State Transitions Diagram
(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)

1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit
2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
(d,n,p) (d,n,p)
7: Dist, Hit 5: NoDist, Miss, NoRp, Shift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift
(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)

3: Dist, Miss, Rp 6: NoDist, Miss, Rp 3: Dist, Miss, Rp 6: NoDist, Miss, Rp
End of State End of State
61 62
State Transitions Diagram Outline

Input of the Model
(d-1,n-1,p) (d,n-1,p) Replacement Policy Model
1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit Case Study
4: NoDist, Miss, NoRp, NoShift
Conclusions
2: Dist, Miss, NoRp, Shift
(d,n,p)
7: Dist, Hit 5: NoDist, Miss, NoRp, Shift
(d-1,n-1,p+1) (d,n-1,p+1)
3: Dist, Miss, Rp 6: NoDist, Miss, Rp
End of State
63 64
16
Case Study: Using Model Only Case Study Unimodal
Number of
Peak
Accesses
Goal : When is LRU pathological? Case1: Unimodal Working Set
1 2 … … A … … … …
Hard to pinpoint with simulations because many possible Stack Position
contributing factors 100%
Isolate the impact of temporal reuse pattern
80%
60%
Synthetic Stack Distance Profiles:
L2 Miss rate
NMRU4
NMRU1
40%
Rand-LRUskw
Rand-MRUskw
20%
LRU
0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Peak stack distance
LRU exhibits pathological performance

Unimodal Bimodal Continuous
Miss Rates: Rand-MRUskw < NMRU1 < Rand-LRUskw < NMRU4
Assume associativity A=8
65 66
Case Study Bimodal

Conclusions
Number of
Peak1
Accesses
Two useful modeling tools:

Peak2

Case2: Bimodal Working Set Circular sequence profiling
1 2 … … A … … … …
Stack Position
Bi-m odal (Peak1 is 1)
Bi-m odal (Peak1 is 1)
For capturing temporal reuse patterns
50%
NMRU450%
NMRU4
Markov processes
NMRU1
40% Rand-LRUskw
40%
NMRU1
Rand-LRUskw For capturing probabilities of certain cache states
Rand-MRUskw Rand-MRUskw
Cache miss is an event associated with certain cache state
L2 Miss Rate
LRU
L2 Miss Rate
LRU
30% 30%
Modeling can reveal non-obvious insights:
20% 20%
Cache miss rates due to shared cache space contention
10% 10% Not capturable by simple metrics low IPC vs. high IPC, low
0%
miss rates vs. high miss rates, int vs. floating-point
0%
9 10 9 11 10 12 11 13 12 14 13 15 14 16 15 16 Interaction of temporal reuse patterns of several applications
Peak2's Stack Position
Peak2's Stack Position
LRU exhibits pathological performance Choosing replacement policy

LRU has quite a few pathological cases
Miss Rates: Rand-MRUskw < NMRU1 < Rand-LRUskw < NMRU4
For apps with working set size 1-4X cache size, others
Performance order the same as in art, ammp, cg
policies outperform LRU
In fact, those apps have approx. bimodal stack distance profiles
67 68
17
Introduction
Analyical CAche Performance Prediction (ACAPP)
tool suite
How to Use ACAPP Prediction for different cache associativities
Prediction for different cache replacement polices
Prediction for cache contention when two threads share the cache
Adding new replacement policies with user-specified RPFs
Input
Circular Sequence profile of each application
Generated by any simulator follow certain format
Providing extension code for SimpleScalar
Released
Available for download in
http://www.ece.ncsu.edu/arpers
70
Prepare Input Files Startup

# ./acapp –h
$HOME/Simplesim-3.0/
$HOME/acapp/addOnSimpleScalar/
acappProfiler.c
******** ACAPP TOOL HELP MENU ********
acappProfiler.h General Usage:
cache.c -h --- HELP MENU
cache.c
sim-outorder.c Prediction under varying cache associativity:
sim-outorder.c
Makefile -a <assoc> [<min assoc> <max assoc>] -f1 <profile1>
Makefile
Prediction under varying cache replacement policies:
-p <rpindex> -f1 <profile1>
#sets 1024 -pA -f1 <profile1>
# cd $HOME/Simplesim-3.0 #assoc 4 Prediction under cache sharing:
#scaling_factor 4 -c -f1 <profile1> -f2 <profile2>
# make #block_size 64 Adding new replacement policy:(require 'usr_rp.in')
# ./sim-outorder swim.ref.eio –max:2000000000 -n (default) or
#cseq
-n <dx> <nxmin> <nxmax>
…<benchmark output> 1 2 3 4 5 6 7 8 9 10 11 23 34 29 35 35
Print supported replacement policies
#stackDist -log
# ls swim_train.eio.*
6877989 4199671 2083871 2653828 2327123
-rw-r--r-- swim.ref.eio.csq 3051944 5588497 5034757 2996018 794764
39391 520 387 350 383 465 32532308
71 72
18
Prediction under varying cache associativity Print supported replacement policies
acapp -a <assoc> [<min assoc> <max assoc>] -f1 <profile1> acapp -log
EXAMPLE
EXAMPLE
#./acapp -log
#./acapp -a 4 7 -f1 ./csq/benchmark1.csq
OUTPUT
OUTPUT
* * * * SUPPORTED REPLACEMENT POLICIES LOGFILE * * * *
The CSQ file is generated using the following cache parameters:
1 - NMRU4
Sets: 1024
2 - NMRU1
Associativity: 4
3 - LRUskw
Block size: 64
Original miss rate: 0.768043 4 - MRUskw
Miss rate for A = 4: 0.768043
Miss rate for A = 5: 0.733912 Prepl(i)
Miss rate for A = 6: 0.689150 1 1 1 1 1
LRU NMRU1 NMRU4 0.8 Rand-MRUskw 0.8 Rand-LRUskw
Miss rate for A = 7: 0.607186 0.6 0.6
0.5 0.5 0.5 0.4 0.4
0.2 0.2
Note: benchmark1.csq represents the L2 profile of swim (ref input set). 0 0 0 0 0
benchmark2.csq represents the L2 profile of apsi (ref input set). 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
stack position i stack position i stack position i stack position i stack position i
benchmark3.csq represents the L2 profile of ammp (ref input set).
73 74
Prediction under varying cache replacement Prediction for all supported replacement
policies policies
acapp -pA -f1 <profile1>
acapp -p <rpindex> -f1 <profile1>
EXAMPLE
EXAMPLE #./accap -pA -f1 ./csq/benchmark3.csq
#./acapp -p 2 -f1 ./csq/benchmark1.csq
OUTPUT
OUTPUT
Sets: 1024
Associativity: 8
Sets: 1024
Block size: 64
Associativity: 4
Block size: 64
1 - NMRU4 3 - LRUskw
Prediction Result for NMRU1
Prediction Result for NMRU4 Prediction Result for LRUskw
LRU: 0.768043
LRU: 0.702653 LRU: 0.702653
Pred: 0.739310
Pred: 0.406974 Pred: 0.376921
******************************* *******************************
2 - NMRU1 4 - MRUskw
Prediction Result for NMRU1 Prediction Result for MRUskw
LRU: 0.702653 LRU: 0.702653
Pred: 0.331667 Pred: 0.201158
******************************* *******************************
75 76
19
Prediction under cache contention Adding new replacement policy
require “usr_rp.in” file:
#This is the configuration file of user specified replacement policy.
accap -c -f1 <profile1> -f2 <profile2> #Please do no change the format of this file and the NAME, ASSOC or PROB keywords. Only
#the values of each of line can be changed by the user.
EXAMPLE NAME newRp
ASSOC 8
#./accap -c -f1 ./csq/benchmark1.csq -f2 ./csq/benchmark2.csq PROB
0.2 0.1 0.05 0.25 0.15 0.07 0.03 0.15
OUTPUT
The CSQ file is generated using the following cache parameters: accap - n (default)
Sets: 1024 EXAMPLE
Associativity: 4
Block size: 64 #./accap –n
******** RESULTS ********
OUTPUT
./csq/benchmark1.csq ./csq/benchmark2.csq Creating new replacement policy...
Accesses: 68182266 11805186 Coefficient file(s) added to: ./fine/newRp/d_0
Predicted miss rate: 0.868560 0.338920 Coefficient file(s) added to: ./fine/newRp/d_1
…
Original miss rate: 0.768043 0.244366 …
Coefficient file(s) added to: ./fine/newRp/d_11

Replacement policy: '5 - newRp' Added Successfully!
77 78
Adding new replacement policy (continue) Acknowledgement

By using ./acapp –n (default), a certain number of coefficient files that Researchers
cover the most popular combinations of d and n are generated. In case
some benchmarks’ cseq profiles have special combinations of d and n, Fei Guo
(reported as “Missing Coefficient files” by the tool if any). User can also Dhruba Chandra
generate coefficient files for certain d and n.
Shaunak Joshi
accap - n <dx> <nxmin> <nxmax>
EXAMPLE
Seongbeom Kim
#./accap 5 7 9 Funding Agency
OUTPUT NSF
Adding coefficient file(s) to rp:newRp
Intel
Coefficient file(s) added to directory: ./fine/newRp/d_5
Coefficient files for Replacement policy: '5 - newRp' Added Successfully! IBM
79 80
20
The Effects of Miss Clustering on
the Cost of a Cache Miss Acknowledgments
Arthur Nadas
Jim Mitchell
Phil Emma Jane Bartik
Allan Hartstein Dan Prener
Thomas R. Puzak Peter Oden
Doug Logan
Viji Srinivasan
John Griswell
C R Attanasio
Danny Lynch
Dept: Systems Technology and Microarchitecture
Moin Qureshi
IBM T. J. Watson Researh Center
How do You Measure The Cost of a Cache Miss?

Pipeline Spectroscopy is a new technique that allows
Processor Performance
classification (analysis) of single events in a Processor's Pipeline. (cycles/instruction) has two
Initial focus of project was to develop a means to measure the components
cost of a cache miss but spectroscopy leads to much greater
insight in pipeline dynamics, including effects due to cache Cycles
Cycles Cycles
miss behavior, prefetching, pipeline recycle, branch prediction Instructions = Instructions + Instructions
errors, and trailing edge effects.
cache
Finite Cache Adder
Cost of each miss is displayed as a Histogram. The graphs are
called spectrograms because they reveal certain signature
Figure of merit Figure of merit
features of the processor's memory hierarchy, the pipeline,
Overall for processor for cache and
and the miss pattern itself (amount of overlap between misses
design memory
in the miss cluster).
hierarchy
Memory Hierarchy Substituting
(cycles/instruction) has two Cycles Cycles Cycles Misses
Instructions = Instructions + Instructions
components cache
Miss
Overall Finite Cache Adder
Cycles Cycles Misses

Instructions = Miss Instruction
To Calculate Miss Penalty (Cycles/Miss)
Finite Cache Adder
Cycles Cycles
Instructions - Instructions
Due to: Due to:
Overall cache
Total Memory Memory speed, Cache Size,
Cycles
Effect Line size, Line size, = Miss
Bussing and Replacement Algorithm Misses
Overlap Instructions
Constructing Miss Spectrogram

Constructing Miss Spectrogram I1 I2 I3 I4 I5 time
t1 infinite cache
Misses are grouped into clusters and the run time

associated with the instruction sequence that 'surrounds' time
I1 I2 (I1) I3 (I2) (I3) I4 I5 (I4) (I5)
the miss cluster is compared to the infinite cache run time t2 finite cache
of the same instruction sequence. No Miss
miss miss miss miss
Activity
Miss Cluster Size = 1
Miss Cluster Size = 3
The difference between these two times is used to
Miss Cluster Size = Number of Misses During Miss Facility Busy Interval
construct the miss spectrogram
[(I3-I1)FC - (I3-I1)IC] + [(I5-I4) FC - (I5-I4)IC] + [.......] =
Finite Cache Adder
Ê1= (t2-t1) for the ith cluster S Ê i = (CYC
FC
CYCIC )
20%
15% L2 = 15 Cycles
Cluster Size = 1
Percent
10%
L3 = 75 Cycles
5%
0% Memory = 300 Cycles
20% 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
15% L2 = 15 Cycles Cluster Size = 1 12%
Percent
L3 = 75 Cycles 10%
10% 8%
Cluster Size = 2
Percent
Memory = 300 Cycles 6%
4%
5% 2%
0%
0% 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
6%
5%
4%
Cluster Size = 3
Percent
3%
2%
1%
0%
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
3%
3%
2%
2%
Cluster Size = 4
1%
1%
0%
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
Number of peaks in a miss spectrograph

Pipeline Spectroscopy Tool Set
L = Memory Hierarchy Levels
C = Miss Cluster Spectroscopy
Miss/Cost
N = Number of Peaks Timer Program
Report
C+L
( )
C
=N Decode/Endop
times, Miss Info
Miss Addr/Inst Addr

Sort
3%
3%
2%
Cluster Size = 4 Report Output Miss
Spectrogram
2%
1%
1% Cost/Analysis
0% Report
540 600 660 720 780 840 900 960 1020 1080 1140 1200 1260
570 630 690 750 810 870 930 990 1050 1110 1170 1230
Miss/Cost Report
Cost Analysis Report
Highest count items: Highest cost items:
Cluster Size Cost Infimum Supremum Miss Inst Inst ASID Inst Addr Count % of Total ASID Inst Addr Total Cost % of Total
012E 93DAA 4410 0.49% 0146 9724EA6 144045 0.33%
Num Inst Inst Address Address Number 0134 9CDAA 3854 0.43% 0146 3BE10A3E 142423 0.32%
0146 3B46E 3023 0.34% 0146 3BE10A02 139804 0.32%
0146 9724EA6 2596 0.29% 0146 96F2A68 136345 0.31%
2797 3 45 348389 348417 20B1FA20 000668EC 348389 012E 3CF3DD62 2444 0.27% 012E 93DAA 125253 0.28%
2797 3 45 348389 348417 20B1FB40 000668EC 348401
2797 3 45 348389 348417 20B1FC00 000668EC 348409
2798 3 45 348421 348449 20B1FD20 000668EC 348421

2798 3 45 348421 348449 20B1FE40 000668EC 348433
2798 3 45 348421 348449 20B1FF00 000668EC 348441
Misses By Cluster size Base Case

0.5
Cycle Per Miss and
0.4
Fraction of Misses
Cluster Size Analysis
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
1 3 5 7 9 11 13 15 17 19 21
Cluster Size
How do you use a spectrogram?
Cycles Per Miss Versus Cluster Size
80 Analyze Prefetching algorithm
70 Hardware or Software
Cycles Per Miss
60
50 Analyze a hardware Design

40
30 Analyze Cluster Patterns and Cost of a Miss

20
1 2 3 4 5 6 7 8 9 10
Science or Theory of Misses
Cluster Size
Theory:
Theory:
Can we Predict the shape of a spectrogram (cluster size = 3, 4, or 5)
From analyzing smaller miss clusters? Can we Predict the shape of a spectrogram (cluster size = 3, 4, or 5)
From analyzing smaller miss clusters?
Observations:
30%
Observations:
20% L2 = 15 cycles Cluster Size = 1
10% Memory = 100 cycles
For cluster size = 1, we can have a Hit or Miss in the L2
0%
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 H or M
9%
6% Cluster Size = 2 2
For cluster size = 2, we can have 2 = 4 possible outcomes.
3%
HH,HM, MH, MM
0%
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450
3
4% For cluster size = 3, we can have 2 = 8 possible outcomes.
Cluster Size = 3 HHH, HHM, HMH, HMM, MHH, MHM, MMH, MMM,
2%
0% 4
For cluster size = 4, we have 2 =16 possible outcomes
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450
6% HHHH, .... .... MMMM
4%
Cluster Size = 4
2%
0% C
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 For cluster of size C there are 2 possible outcomes.
Determine unique sums for N items choosing C
Number of Peaks in spectrogram with C clusters and N levels
C
90
L=3 S N
=
N
+
N
+
N
+
N
+ +
N
80
70
L=1
Peaks = ( )
C+N
C
i=0
i 0 1 2 3 C
Number of Peaks
c
2 C=Cluster Size
60
L=2
Hit/Miss
Using Relation
N
k
= ( )
N+K-1
K
= 2C
50 L=3 Combinations
40 = ( ) () ( ) ( )
N-1
0 +
N
1 +
N+1
2 +
N+2
3 + +
( )
N+C-1
C
30 H/M Combinations
L=2
20
We can combine first two term using
( )=( ) ( )
N
K
N-1
K-1 +
N-1
K (1a)
10 L=1
0
= () () ()
N+1
1 +
N+1
2 +
N+2
3 + + ( )
N+C-1
C
1 2 3 4 5 6
Cluster Size Applying 1a repeatedly, the series collapses to ( ) N+C
C
Theory: Predicting the Shape of a

Spectrogram Theory: Predicting the Shape of a
From
Observations: a Smaller
Configure Cluster
L2 to have 50/50Spectrogram
% Hit/Miss Spectrogram
Probabilities From a Smaller Cluster Spectrogram
Observations: Configure L2 to have 50/50 %
If Hit and Misses in the L2 were independent then we could Hit/Miss Probabilities
use the binomial distribution to describe H/M clusters
Hit/Miss Probabilities for Cluster Sizes = 2

For example, if Pr[M] = p then Pr[H] = (1-p) then probability
of k misses in a cluster of size N is
and 3
k N-k
( ) p (1-p)
N
k
Cluster = 2 HH HM MH MM
Prob's .306 .175 .188 .331
So in a cluster of size 1 Pr[M] = Pr[H] = .50
CL 3 HHH HHM HMH HMM MHH MHM MMH MMM

Then in cluster of size 2
Prob .208 .105 .090 .098 .103 .094 .083 .219
Pr[HH] = Pr[HM] = Pr[MH] = Pr[MM] = .25
MM Cluster Overlap = 35/65 MH Cluster Overlap = 62/38
0.07 0.03 MHM Cluster Overlap = 19 13 45 23
0.06 0.025 MMM Cluster Overlap = 15 42 43 0.01
0.03 0.008
0.05
Percent
0.02
Percent
Percent
0.025
0.04 0.006
Percent
0.02
0.015 0.015 0.004
0.03
0.01 0.002
0.02 0.01
0.005
0
0.01 0.005 0 90 120 150 180 210
50 100 150 200 250 300 350 105 135 165 195 225
0 0 cluster=3 cluster=3
50 100 150 200 250 70 80 90 100 110 120 130
Cluster = 2 Cluster = 2
HM Cluster Overlap = 34/66 HH Cluster Overlap = 45/55
0.05 0.08 HHH Cluster Overlap = 12 48 40
0.07 0.04
0.04 HMH Cluster Overlap = 21 46 33
0.06
0.01 0.03
Percent
Percent
Percent
0.03 0.05
0.008
0.02
Percent
0.04
0.006
0.02
0.03 0.004 0.01
0.02 0.002
0.01 0
0.01 0 0 10 20 30 40 50 60
80 90 100 110 120 130 140 150 cluster=3
0 0
cluster=3
70 80 90 100 110 120 130 0 10 20 30 40 50 60
Individual Miss Spectrogram for Cluster Size = 2, L1=64KB,

Figure 7. Individual Miss Spectrogram for Cluster Size = 3, L1=64KB,
L2=256KB 15 Cycle Latency, L3 = 100 Cycles Latency, Data for OLTP
What is a?
Let Xi represent ith hit or miss and let 0 represent a hit and
Let Prob[M] = P Then Prob[Hit] = (1-P) 1 represent a miss, then Xi =M or Xi=H is equivalent to Xi=0 or Xi=1.
By definition the correlation coefficient
If Hit and Miss Probabilities are independent then
q= COV(X1,X 2)
Prob[HH] = Prob[HM] = Prob[MH] = Prob[MM] = .25 Var(X 1)Var(X2 )
COV(X1,X2) = E[X1,X2] - E[X1]E[X2]

+
Prob[MM] = PxP E[X1,X2] = Pr[X 1X 2=1] = Pr[X 1=1,X2=1] = p(1-a+ap)
E[X1] = E[X2] =p
COV(X1,X2) = p(1-a+ap)-p =(1-a)[p(1-p)]
2
In Order to increase Prob of MM we define a
Var(X1) = Var(X 2) = E[(X 1 – l ) ] = E[X 1 ] - p = p(1-p)
2 2 2
and form Convex Combination between P and 1 (1-a)[p(1-p)]

q = p(1-p)
aP + (1-a) = 1-a+aP
q = (1-a) or a = (1-q)
so Prob[MM] = P x (1- a+aP)

+
Let Pr[M] = P and Pr[H] = (1-P). We seek Function that increases Pr[MM] > PxP. Same technique is used to increase Pr[HH] > (1-P)x(1-P) .
To Increase Pr[X 2=M | X1=M], define a, 0 [ a [ 1 and form convex combination between To Increase Pr[X2=H | X 1=H], define a, 0 [ a [ 1 and form convex combination between
values P and 1 such that Pr[X2=M | X1 =M] = ap + (1-a) = (1- a+ap) values (1-P) and 1 such that Pr[X2 =H | X1=H] = (1-P) a + (1- a) = (1-aP)
Note the function (1-a+ap) has the properties we desire. For 0 [ a [ 1 and P [ 1, (1- a+ap) m P. Note the function (1-aP ) has the properties we desire. For 0 [ a [ 1 and P [ 1, (1-aP ) m (1-P).
So Pr[MM] = P (1-a+ap) So Pr[HH] = (1-P) (1-aP )
Also for a = (1- q), as q d 1, (the correlation between XiXi+1 becomes stronger) a d 0 and Also for a = (1- q), as q d 1, (the correlation between XiXi+1 becomes stronger) a d 0 and
Pr[X2=M | X1 =M] = (1-a+ap) d 1. Perfect Correlation Pr[X2=H | X1=H] = (1-aP ) d 1. Perfect Correlation
For q d 0, (the correlation between XiXi+1 becomes weaker) a d 1 and For q d 0, (the correlation between XiXi+1 becomes weaker) a d 1 and
Pr[X2=M | X1 =M] = (1-a+ap) d P. As if they are independent. Pr[X2=H | X1=H] = (1-aP ) d P. As if they are independent.
For example, H/M sequences HMM, MHHM,

and HMMMH then have the prob
In Order to increase Prob of HH we define
(1-p)(ap) (1-a+ap) and (p)(a-ap)(1-ap)(ap) and
form Convex Combination between (1-P) and
1 (1-p) (ap) (a-ap) (1-a+ap)2
a = 1-ap
(1-P)+(1-a) H/M probabilities of any length can be calculated using
Similarly Prob[HH] = (1-P) x (1-ap) the generating function
Xi+1 (1-p) (1-ap)w (ap)x (a-ap)y (1-a+ap)z if it starts with a hit
Hit Miss and (p)(1- ap)w (ap)x (a-ap)y (1-a+ap)z if it starts with a miss
Hit 1-ap ap Xi+1

Xi Hit Miss
Miss a-ap 1-a+ap
Hit 1-ap ap
Probability State Transition Matrix For HH, Xi
MM, MH, and HM Events
Miss a-ap 1-a +ap
Using The Transition Matrix we can Generate
a Hit/Miss Probability for any size cluster using Theory: Predicting the Shape of a
Spectrogram
From a Smaller Cluster Spectrogram
(1-ap)w (ap)x (a-ap)y (1-a+ap)z
Cluster = 2 HH HM MH MM
where w + x + y + z = C - 1 where C=Cluster Size Real .306 .175 .188 .331

Predicted .308 .182 .182 .328
Find Value of a That is 'best fit' from cluster =

2
Data. Then generate Hit/Miss sequences for CL 3 HHH HHM HMH HMM MHH MHM MMH MMM
larger Real .208 .105 .090 .098 .103 .094 .083 .219
Xi+1 Pred .193 .115 .065 .117 .115 .068 .117 .210
Cluster
Hit Miss
Hit 1-a p ap
Xi
Miss a- ap 1-a+ap
Cluster = 2 Cluster = 3 How do we determine Overlap?

8 7
Measure overlap/no-overlap for cluster size = 2
Number of Workloads
Number of Workloads
7 6
6 5
5
4
4 MM Cluster Overlap = 35/65 MH Cluster Overlap = 62/38
3 0.03
3 0.07
2 2 0.06 0.025
1 1 0.05 0.02
Percent
Percent
0 0.04
0 0.015
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.03
Maximum Difference 0.02 0.01

Maximum Difference
0.01 0.005
Cluster = 4 Cluster = 5 0
50 100 150 200 250 0
6 5 70 80 90 100 110 120 130
Cluster = 2
Number of Workloads
Number of Workloads
Cluster = 2
5 4 HM Cluster Overlap = 34/66 HH Cluster Overlap = 45/55
0.05 0.08
4 3
0.07
0.04
3 0.06
2
Percent
Percent
0.03 0.05
2 0.04
1
0.02
1 0.03
0 0.02
0.01
0 0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0
Maximum Difference 70 80 90 100 110 120 130
0
Maximum Difference 0 10 20 30 40 50 60
Kolmogorov-Smirnov Test For Predicted Hit/Miss Probabilities For
Clusters = 2, 3, 4, and 5. Individual Miss Spectrogram for Cluster Size = 2, L1=64KB,
Cluster Size = 2 X1 X1X2 X3=HHH X1 X2 Cluster Size = 3
=.193 OVL
NO
OVL
.45 .55
Hit Miss
15 30
X2 X2X3 X2X3
X 2
OVL NO OVL
NO
OVL
OVL
.45 .55 .45 .55
Hit Miss Hit Miss
15 30 30 45
HH .306 HM .188 MH .175 MM .331 .45X.45X.193= .45X.55.X193= .55X.45X.193= .55X.55X.193=
NO .039 .048 .048 .059
OVL OVL OVL NO OVL NO OVL NO
.45 .55 .34 .66 OVL .62 .38OVL .35 .65OVL
X1X2 X3=MMH X1X2
15 30 100 115 100 115 100 200
=.117 OVL
NO
.35 .65 OVL
.306X.45= .306X.55= .188X.34= .188X.66= .175X.62= .175X.38= .331X.35= .331X.65=
.138 .168 .064 .124 .109 .067 .116 .215 100 200
X2X3 X2 X3 NO
OVL NO OVL
OVL OVL
.62 .38 .62 .38
15 30 100 115 200
100 215
115 200
.138 .168 .288 .191 .215 .35X.62X.117= .35X.38X.117= .65X.62X.117= .65X.65X.117=
Constructing a Miss Spectrogram. HH, MM, HM, MH, and overlap .025 .016 .047 .049
Probabilities for Cluster = 2 Spectrogram. Data is for OLTP workload Overlap/No-Overlap Probability Tree For HHH and MMH.
Data For OLTP
0.07 7
8
Number of Workloads
Number of Workloads
Percent of Misses
0.06 7 6
0.05 6
0.04
Cluster = 3 5
5
4
0.03 4
3
0.02
3
2 2
0.01
1 1
0
0 30 60 90 120 150 180 210 240 270 300 330 0 0
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01 0.04 0.07 0.1 0.13 0.16 0.19
15 45 75 105 135 165 195 225 255 285 315
Maximum Difference Maximum Difference
0.06
Percent of Misses
0.05
Cluster = 4 6 5
Number of Workloads
Number of Workloads
0.04
5 4
0.03
4 3
0.02
3
2
0.01 2
1
0 1
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 0
15 45 75 105 135 165 195 225 255 285 315 345 375 405 435 0 0.01 0.04 0.07 0.1 0.13 0.16 0.19
0.01 0.04 0.07 0.1 0.13 0.16 0.19
Miss Penalty Maximum Difference
Predicted Spectrogram for Clusters = 3 and 4. L1=64KB, Maximum Difference
L2=256KB Kolmogorov-Smirnov Test For Predicted Hit/Miss Probabilities For
15 Cycles Latency, L3=100 Cycle Latency, Data For OLTP Clusters = 2, 3, 4, and 5.
Conclusions
Agreement between and theory and the experiment is very good.
Model uses six parameters

Pipeline Spectroscopy allows us to apply Statistical
Probabilities to analyze miss behavior in a Multilevel Memory
Hierarchy
L2 H/M rate
A Correlation Parameter
4 Overlap/Non-Overlap parameters gleaned from
Same analysis possible to determine Branch Wrong Guess
penalty, software analysis, run ahead, or AGI Penalty
the four H/M patterns in a Cluster Size = 2
Pipeline Spectroscopy allows quantitative analysis of individual

Further analysis needed to gain insight into overlap/non-overlap
miss penalties to considerable precision.
Programming Structure, and Hardware Limitations
Provides means to quantitatively compare different prefetching
Future work needed to study
and memory hierarchy analysis
In-order versus Out-Of-Order Spectrogram differences
SMT, and MLP organizations.

Computer Performance Modeling Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Performance Modeling Techniques

Uploaded by

Copyright:

Available Formats

Computer Performance Modeling

 Goal: Estimating and understanding the

Computer Performance Modeling Modeling Methods

Temporal Locality Behavior

 C>8: incremented whenever the 9th, 10th, … block is accessed 10%

Stack Distance Properties Where to Profile

 Alternatively, keep per-set

Relationship with Stack Distance Profile Collecting Circular Sequence Profile

Access Stream for 1 Set Access Stream for 1 Set

Collecting Circular Sequence Profile Collecting Circular Sequence Profile

Access Stream for 1 Set Access Stream for 1 Set

Access Stream for 1 Set Access Stream for 1 Set

Collecting Circular Sequence Profile

mcf's Normalized IPC

 In today’s CMP, L2 cache is shared by multiple

Modeling Goal Assumptions

Example Inductive Probability Model

Example Final prediction

 Conclusions  Replacement Policy Model

Prepl(.) on 8-way assoc cache

 Stack only needed for modeling, not necessarily for

Tracking Cache Miss Probability Illustration

LRU Stack LRU Stack

LRU Stack LRU Stack

LRU Stack LRU Stack

Illustration Modeling Overview

 New state depends on 8 events:

State Transitions Diagram State Transitions Diagram

(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)

(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)

End of State End of State

(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)

(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)

End of State End of State

State Transitions Diagram Outline

7: Dist, Hit 5: NoDist, Miss, NoRp, Shift

Peak stack distance

 LRU exhibits pathological performance

Case Study Bimodal

Two useful modeling tools:

 LRU exhibits pathological performance  Choosing replacement policy

Prepare Input Files Startup

Coefficient file(s) added to: ./fine/newRp/d_11

Adding new replacement policy (continue) Acknowledgement

How do You Measure The Cost of a Cache Miss?

Cycles Cycles Misses

Constructing Miss Spectrogram

Misses are grouped into clusters and the run time

Number of peaks in a miss spectrograph

Miss Addr/Inst Addr

2798 3 45 348421 348449 20B1FD20 000668EC 348421

Misses By Cluster size Base Case

50 Analyze a hardware Design

30 Analyze Cluster Patterns and Cost of a Miss

Theory: Predicting the Shape of a

Hit/Miss Probabilities for Cluster Sizes = 2

CL 3 HHH HHM HMH HMM MHH MHM MMH MMM

Individual Miss Spectrogram for Cluster Size = 2, L1=64KB,

Prob[HH] = Prob[HM] = Prob[MH] = Prob[MM] = .25 Var(X 1)Var(X2 )

COV(X1,X2) = E[X1,X2] - E[X1]E[X2]

and form Convex Combination between P and 1 (1-a)[p(1-p)]

so Prob[MM] = P x (1- a+aP)

So Pr[MM] = P (1-a+ap) So Pr[HH] = (1-P) (1-aP )

Goal: Estimating and understanding the

C>8: incremented whenever the 9th, 10th, … block is accessed 10%

Alternatively, keep per-set

In today’s CMP, L2 cache is shared by multiple

Conclusions Replacement Policy Model

Stack only needed for modeling, not necessarily for

New state depends on 8 events:

LRU exhibits pathological performance

LRU exhibits pathological performance Choosing replacement policy