You are on page 1of 31

Computer Performance Modeling

„ Goal: Estimating and understanding the


Practical Cache Performance performance of computer systems
„ Low-Level Models
Modeling for Computer Architects ‰ Various levels of details: functional, trace, cycle-accurate,
etc.
‰ Pros
„ Versatile: adaptable to new architectures and workload
„ Details: can embed very detailed statistics
Yan Solihin, NCSU, solihin@ncsu.edu „ Popular: SimpleScalar, SIMICS widely used
Fei Guo, NCSU, fguo@ncsu.edu ‰ Cons
Thomas Puzak, IBM, trpuzak@us.ibm.com „ O(n) overhead, scales with workload size: each
Phil Emma, IBM, pemma@us.ibm.com instruction/event is simulated to capture its effect
„ Slow: realistic workload has many billions of instructions

Computer Performance Modeling Modeling Methods


„ High-Level Models „ White Box
‰ Purpose: Evaluating gross trade-offs of designs ‰ Model incorporates knowledge about the system (e.g.
‰ Pros relationships of parameters are known a priori)
„ Short execution time, and sometimes O(1) ‰ Analytical or heuristics-based
„ Requires little coding ‰ Pros: models reveal insights &explain “why”, no training
„ Reveal basic relationships of variables required
„ May reveal non-obvious trends and insights ‰ Cons: problem-specific solution
‰ Cons „ Black Box
„ Less versatile ‰ Model learns knowledge about the system
„ Requires performance modeling expertise
‰ AI-based: neural networks, decision tree, curve fitting, etc.
‰ Uses:
‰ Pros: can model complex problem/system
„ Early design cycle: for pruning design search space
‰ Cons: prediction without insights, requires training
„ Entire design cycle: to re-evaluate design search space if
requirements change
3 4

1
Focus of this tutorial Program
„ Types: 8:30 – 8:45: Introduction
‰ High Level, Hybrid High/Low level 8:45 – 9:00: Capturing temporal locality behavior
„ A priori knowledge: 9:00 – 9:30: Modeling cache sharing
‰ White Box 9:30 – 10:00: Modeling cache replacement policy
„ Scope: 10:00 – 10:30: Coffee Break
‰ Miss count & rate
10:30 – 11:30: Analysis of the effects of miss
‰ Miss cost
clustering on the cost of a cache miss
‰ Bandwidth usage
11:30 – 12:30: Interaction of Caching and Bandwidth
Pressure

5 6

Temporal Locality Behavior


„ Programs exhibit locality of references
Capturing Temporal Locality „ Spatial locality: the neighbor of recently-accessed
data tends to be accessed in the near future
Behavior „ Temporal locality: recently-accessed data tends to
be accessed again (or reused) in the near future
„ Significance: temporal reuse & cache parameters
determine all non-cold misses
‰ If each memory block is accessed exactly once, we only
have cold misses
‰ Cold misses affected by block size
„ How can we capture temporal locality?

2
Stack Distance Profiling [Mattson’70] Typical Shape
„ Early attempt to capture temporal reuse behavior „ Empirical observation ⇒ Geometric or exponential sequence
„ Models LRU stack with a counter for each stack position „ Due to temporal locality
„ Example: fully associative cache with 8-entry stack „ Ci+1 = Ci x r, where 0<r<1 is the common ratio
‰ C1: incremented whenever the MRU block is accessed Incremented on access to MRU line
‰ C2: incremented whenever the 2nd MRU block is accessed
30% Incremented on access to 2nd MRU line
‰ C3: incremented whenever the 3rd MRU block is accessed

Percent of Accesses
25%
‰ … Incremented on access to 3rd MRU line
20% Incremented on a miss
‰ C8: incremented whenever the 8th MRU (or LRU) block is
accessed 15%

‰ C>8: incremented whenever the 9th, 10th, … block is accessed 10%


5%
0%
C1 C2 C3 C4 C5 C6 C7 C8 C>8
Stack Distance Counters

9 10

Stack Distance Properties Where to Profile


„ For fully-associative LRU ∞ P
cache with A blocks, the
number of misses of the
Misses = ∑C
i = A+1
A
For capturing temporal reuse patterns at the
cache is L1 cache levels
⇒ Predict cache misses for various L1 cache
L1 Instruction L1 Data configurations
„ For A-way set associative
LRU cache, we can collect ∞ For capturing temporal reuse patterns at the
set-specific stack distance
profile, and the number of
misses of the set is:
Misses = ∑ CA
i = A +1 L2 Cache
L2 cache levels
⇒ Predict cache misses for various L2 cache
configurations

„ Alternatively, keep per-set


stacks, but use global set of
counters Mem

11 12

3
Limitations of Stack Distance Profile Definitions
„ Useful only for predicting cache misses across „ seq (d,n) = sequence of n accesses to d distinct addresses
different cache associativities (in a cache set)
„ cseq (d,n) (circular sequence) = a sequence in which the
„ For other purposes, we need to capture temporal first and the last accesses are to the same address
reuse patterns in greater details
„ So, use Circular Sequence Profile [Chandra’05]
‰ Extends stack distance profiling seq(5,8)
‰ Counts the occurrence of cseq(d,n) cseq(5,7)
A B C D A E E B
cseq(4,5) cseq(1,2)

13 14

Relationship with Stack Distance Profile Collecting Circular Sequence Profile


„ Cx = number of circular sequences cseq(d=x,n=any MRU LRU
value), or 1 2 3 4
∞ LRU Stack
C x = ∑ cseq( x, n) Access Counter
n= x
Access Stream for 1 Set
„ Hence, stack distance profile is a subset of circular d
1 2 3 4 A B C B A
sequence profile
1
n 2
3
4
5

15 16

4
Collecting Circular Sequence Profile Collecting Circular Sequence Profile
MRU LRU MRU LRU
1 2 3 4 1 2 3 4
LRU Stack A LRU Stack B A
Access Counter 1 Access Counter 1 2

Access Stream for 1 Set Access Stream for 1 Set


d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 3
4 4
5 5
… …
17 18

Collecting Circular Sequence Profile Collecting Circular Sequence Profile


MRU LRU MRU LRU
1 2 3 4 1 2 3 4
Found a
LRU Stack C B A LRU Stack C B A
Circular Sequence!
Access Counter 1 2 3 Access Counter 2 3 4

Access Stream for 1 Set Access Stream for 1 Set


d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 3 1
4 4
5 5
… …
19 20

5
Collecting Circular Sequence Profile Collecting Circular Sequence Profile
MRU LRU MRU LRU
1 2 3 4 1 2 3 4
Found a
LRU Stack B C A LRU Stack B C A
Circular Sequence!
Access Counter 1 2 4 Access Counter 2 3 5

Access Stream for 1 Set Access Stream for 1 Set


d d
1 2 3 4 A B C B A 1 2 3 4 A B C B A
1 1
n 2 n 2
3 1 3 1
4 4
5 5 1
… …
21 22

Collecting Circular Sequence Profile


MRU LRU
1 2 3 4
LRU Stack A B C
Predicting Contention Across
Access Counter 1 2 3 Cache
Access Stream for 1 Set
d
1 2 3 4 A B C B A
1
n 2
3 1
4
5 1

23

6
Shared Cache Challenge Impact of Cache Space Contention
400%

L2 Cache Misses

mcf's Normalized IPC


100%
350%
CMP 300% 80%
250% 60%
200%
40%
P P 150%
100% 20%
50%
0% 0%
L1 L1

mcf+mst

mcf+gzip
mcf+art

mcf+swim
Alone
mcf+swim

mcf+gzip
mcf+mst
mcf+art
Alone
L2 Cache

„ In today’s CMP, L2 cache is shared by multiple


cores „ Application-specific
„ Coschedule-specific
„ Applications on different core compete for L2 cache
„ Significant: Up to 4X cache misses, 65% IPC reduction
space
How to model the impact of cache sharing?

25 26

Modeling Goal Assumptions


„ Given n applications, predict the miss rates of any „ LRU Replacement Algorithm
pair of applications „ Applications share nothing
„ Input: ‰ Mostly true for sequential apps (except for library and OS
‰ Behavior of each application code)
‰ Cache parameter „ Applications not similar
‰ Relative speed when the pair runs together ‰ Parallel apps: threads likely to show uniform behavior, so
„ Output: predicting their miss rates is trivial
‰ Number of cache misses for each application in the pair

27 28

7
Circular Sequence Properties Example
„ Thread X runs alone in the system: „ Assume a 4-way associative cache:
‰ Given a circular sequence cseqX(dX,nX), the last access is
a cache miss iff dX > Assoc X’s circular sequence Y’s intervening
cseqX(2,3) access sequence
„ Thread X shares the cache with thread Y: A B A U V V W
‰ During cseqX(dX,nX)’s lifetime if there is a sequence of
intervening accesses seqY(dY,nY), the last access of thread lifetime
X is a miss iff dX+dY > Assoc
No cache sharing: A is a cache hit
Cache sharing: is A a cache hit or miss?

29 30

Example Inductive Probability Model


„ Assume a 4-way associative cache: „ Define Pmiss(cseqX) = probability of the last access is
a cache miss
X’s circular sequence Y’s intervening „ For each cseqX(dX,nX) of thread X
cseqX(2,3) access sequence ‰ Compute the number of intervening accesses from thread
A B A U V V W Y during cseqX’s lifetime ⇒ denote as nY
‰ It is possible to have dY = 1, 2, … , nY ⇒ compute the
probability of each dY, denoted as P(seq(dY, nY)).
A U B V V A W A U B V V W A ‰ For each dY = 1, 2, … , nY
„ If dY + dX > Assoc, Pmiss(cseqX) = Pmiss(cseqX) + P(seq(dY, nY))
Cache Hit Cache Miss „ If dY + dX ≤ Assoc, Pmiss(cseqX) is kept the same
seqY(2,3) intervening in seqY(3,4) intervening in ‰ Misses = old_misses + ∑ Pmiss(cseqX) x F(cseqX)
cseqX’s lifetime cseqX’s lifetime

31 32

8
Computing P(seq(dY, nY)) Overall Formula
d
„ Basic Idea: „ Define: ∑C i
P (d − ) = i =1
and P(d + ) = 1 − P(d − )
seq(d,n) + 1 access to an ∞

already-seen address ∑C i
+ 1 access to a i.e. forming a circular sequence i =1
new address A B with 1..d distinct addresses
„ P(seq(d,n)) is computed by:
seq(d-1,n-1) seq(d,n-1) 1 d = n =1
 P ((d − 1) + ) × P( seq(d − 1, n − 1)
„ This is a Markov process with 3 states, and 2 edges  d = n >1
„ P(seq(d,n)) = A * P(seq(d-1,n-1)) + B * P(seq(d-1,n)) P( seq(d , n)) =  P (1− ) × P( seq(1, n − 1)) n > d =1
d  P (d − ) × P( seq(d , n − 1)) n > d >1
∑C i 
B= i =1

and A = 1− B  + P((d − 1) + ) × P( seq(d − 1, n − 1))
∑ Ci
i =1
33 34

Example Final prediction


seq(2,3) „ After we obtain Pmiss(cseqX(dX,nX)) for all
cseqX(dX,nX),

P (1+ ) P (2 − )
„ Predict the total misses for thread X:

seq(1,2) seq(2,2) A
missX = oldmissX + ∑ Pmiss(cseqX (d X , nX )) × Cd X
P(1− ) P(1+ ) d X =1

seq(1,1) seq(1,1)
1 1

35 36

9
Observations
„ Based on how vulnerable to cache sharing impact:
‰ Some are highly vulnerable
‰

‰
Some are not vulnerable
Many are somewhat / sometimes vulnerable
Modeling Replacement Policy
„ Insights:
‰ Traditional characterizations: not indicative of impact of
Performance
sharing
„ Low vs. High IPC
„ Int vs. Floating-Point
„ High Miss Rate vs. Low Miss Rate
‰ Rather, interaction of temporal reuse behavior determine
impact of cache sharing

37

Motivation Motivation
„ Cache design critical to performance „ Performance variation due to replacement policy is significant
L2 Miss Rates Normalized Exec Time
‰ Memory wall: cache miss cost hundreds of processor
cycles 100%
18% 32%
100% 13% 19%
‰ Capacity pressure: Multi-core design, Virtual Machine
80% 80%
67%
60% 47% 60%
40% 40%
20% 20%
0% 0%
art ammp cg art ammp cg
„ Important parameters: size, associativity, block size,
LRU Rand-MRUskw LRU Rand-MRUskw
and replacement policy „ No agreement on the “best implementation”
‰ Intel Pentium: LRU
‰ Intel XScale: FIFO
‰ IBM Power 4: tree-based pseudo-LRU
‰ Others: round robin, random, replacement hints, etc.

39 40

10
Motivation Would be useful to model replacement
policies
„ No analytical model, past models assume
‰ LRU [Cascaval03, Chandra05, Ghosh97, Quong94, Sen02,
Singh92, and Suh01] App 1 Circular Seq
‰ or Random [Agarwal89, Berg04, Ladner99] Profiling
...
Predicted miss rate for each app
App N Circular Seq on each replacement policy
Prediction
„ LRU/Random simplifies modeling, but Profiling
Model MissRate App 1 ... App N
‰ Ignores performance variation due to replacement policy RP 1
RP 1’s Replacement
‰ Inaccurate for highly associative caches Probability Function (RPF) ...
...
RP M’s Replacement RP M
Probability Function (RPF)

41 42

Outline Outline
„ Input of the Model „ Input of the Model
„ Replacement Policy Model ‰ Replacement Probability Function (RPF)
„ Case Study ‰ Circular Sequence Profiles

„ Conclusions „ Replacement Policy Model


„ Validation
„ Conclusions

43 44

11
Replacement Prob Function (RPF) Outline
„ RPF, denoted as Prepl(.) = a probability function, where Prepl(i) „ Input of the Model
is the probability that a cache block on the ith stack position
is replaced on a cache miss. „ Replacement Policy Model
‰ Markov states
1 1 1
Prepl(i)
1 1
‰ Markov state transitions
LRU NMRU1 NMRU4 0.8 Rand-MRUskw 0.8 Rand-LRUskw
0.5 0.5 0.5
0.6
0.4
0.6
0.4
„ Case Study
0.2 0.2
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
„ Conclusions
stack position i stack position i stack position i stack position i stack position i

Prepl(.) on 8-way assoc cache

„ Stack only needed for modeling, not necessarily for


hardware implementation

45 46

Tracking Cache Miss Probability Illustration


cseq(4,5) LRU Stack
cseq(4,5)
A B C D A
A B C D A
Target Access MRU LRU
„ Basic idea: (A = Target Block)
‰ Reconstruct each circular sequence by adding each access
‰ In the mean time, track if the target block is replaced „ Assume
„ Markov State = (d, n, p), where ‰ 4-way associative cache ⇒ 4-entry stack
‰ d = number of distinct addresses yet to appear ‰ NMRU-2 replacement policy ⇒ Prepl(3) = Prepl(4) = 0.5
‰ n = number of accesses yet to appear „ Goal
‰ p = current stack position of the target block ‰ Compute the probability that the target access misses

47 48

12
Illustration Illustration

LRU Stack LRU Stack


cseq(4,5) cseq(4,5)
A A
MRU LRU MRU LRU

Initial state: (d=4, n=5, p=∞) Current state: (d=3, n=4, p=1)

49 50

Illustration Illustration

LRU Stack LRU Stack


cseq(4,5) cseq(4,5)
A B B A A B C C B A
MRU LRU MRU LRU

Current state: (d=2, n=3, p=2) Current state: (d=1, n=2, p=3)

51 52

13
Illustration Illustration

LRU Stack LRU Stack


cseq(4,5) A replaced
cseq(4,5)
A B C D C B A with a probability A B C D D C B
of 1/2
MRU LRU MRU LRU

Current state: (d=0, n=1, p=?) Final state: (d=0, n=1, p=∞)

53 54

Illustration Modeling Overview


„ Track current state and transition probabilities into
LRU Stack
cseq(4,5) new states
A B C D A D C B „ Final states:
Cache Miss! MRU LRU ‰ Target block replaced ⇒ cache miss
‰ p > cache associativity ⇒ cache miss
Final state: (d=0, n=1, p=∞) ‰ End of circular sequence reached
„ Accumulate probabilities of cache miss
„ So the probability cseq(4,5)’s target access is a
cache miss is 0.5

55 56

14
State Transitions State Transitions Diagram

„ New state depends on 8 events:


‰ Dist and NoDist (d-1,n-1,p) (d,n-1,p)
„ Dist: the new access is to “distinct” address (not seen before in
this circular sequence) 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit
‰ Miss and Hit
„ Miss: the new access is a cache miss 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
‰ Rp and NoRp (d,n,p)
„ Rp: the new access causes the target block to be replaced
‰ Shift and NoShift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift
„ Shift: the new access causes the target block to be shifted down
in the LRU stack
(d-1,n-1,p+1) (d,n-1,p+1)
„ Note:
‰ PDist, PRp, PShift are directly computable (see [SIGMETRICS’06]) 3: Dist, Miss, Rp 6: NoDist, Miss, Rp
‰ PShift dependent on RPF
‰ PMiss is the object of prediction End of State

57 58

State Transitions Diagram State Transitions Diagram

(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)


1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit

2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
(d,n,p) (d,n,p)

7: Dist, Hit 5: NoDist, Miss, NoRp, Shift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift

(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)


3: Dist, Miss, Rp 6: NoDist, Miss, Rp 3: Dist, Miss, Rp 6: NoDist, Miss, Rp

End of State End of State

59 60

15
State Transitions Diagram State Transitions Diagram

(d-1,n-1,p) (d,n-1,p) (d-1,n-1,p) (d,n-1,p)


1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit 1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit

2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift 2: Dist, Miss, NoRp, Shift 4: NoDist, Miss, NoRp, NoShift
(d,n,p) (d,n,p)

7: Dist, Hit 5: NoDist, Miss, NoRp, Shift 7: Dist, Hit 5: NoDist, Miss, NoRp, Shift

(d-1,n-1,p+1) (d,n-1,p+1) (d-1,n-1,p+1) (d,n-1,p+1)


3: Dist, Miss, Rp 6: NoDist, Miss, Rp 3: Dist, Miss, Rp 6: NoDist, Miss, Rp

End of State End of State

61 62

State Transitions Diagram Outline


„ Input of the Model
(d-1,n-1,p) (d,n-1,p) „ Replacement Policy Model
1: Dist, Miss, NoRp, NoShift 8: NoDist, Hit „ Case Study
4: NoDist, Miss, NoRp, NoShift
„ Conclusions
2: Dist, Miss, NoRp, Shift
(d,n,p)

7: Dist, Hit 5: NoDist, Miss, NoRp, Shift

(d-1,n-1,p+1) (d,n-1,p+1)
3: Dist, Miss, Rp 6: NoDist, Miss, Rp

End of State

63 64

16
Case Study: Using Model Only Case Study Unimodal

Number of
Peak

Accesses
„ Goal : When is LRU pathological? „ Case1: Unimodal Working Set
1 2 … … A … … … …
‰ Hard to pinpoint with simulations because many possible Stack Position
contributing factors 100%
‰ Isolate the impact of temporal reuse pattern
80%

60%
Synthetic Stack Distance Profiles:

L2 Miss rate
„ NMRU4
NMRU1
40%
Rand-LRUskw
Rand-MRUskw
20%
LRU

0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Peak stack distance

„ LRU exhibits pathological performance


Unimodal Bimodal Continuous
„ Miss Rates: Rand-MRUskw < NMRU1 < Rand-LRUskw < NMRU4
Assume associativity A=8
65 66

Case Study Bimodal


Conclusions
Number of

Peak1
Accesses

Two useful modeling tools:


Peak2
„
„ Case2: Bimodal Working Set ‰ Circular sequence profiling
1 2 … … A … … … …
Stack Position
Bi-m odal (Peak1 is 1)
Bi-m odal (Peak1 is 1)
„ For capturing temporal reuse patterns
50%
NMRU450%
NMRU4
‰ Markov processes
NMRU1
40% Rand-LRUskw
40%
NMRU1
Rand-LRUskw „ For capturing probabilities of certain cache states
Rand-MRUskw Rand-MRUskw
Cache miss is an event associated with certain cache state
L2 Miss Rate

LRU
L2 Miss Rate

LRU „
30% 30%
„ Modeling can reveal non-obvious insights:
20% 20%
‰ Cache miss rates due to shared cache space contention
10% 10% „ Not capturable by simple metrics low IPC vs. high IPC, low
0%
miss rates vs. high miss rates, int vs. floating-point
0%
9 10 9 11 10 12 11 13 12 14 13 15 14 16 15 16 „ Interaction of temporal reuse patterns of several applications
Peak2's Stack Position
Peak2's Stack Position

„ LRU exhibits pathological performance ‰ Choosing replacement policy


„ LRU has quite a few pathological cases
„ Miss Rates: Rand-MRUskw < NMRU1 < Rand-LRUskw < NMRU4
„ For apps with working set size 1-4X cache size, others
‰ Performance order the same as in art, ammp, cg
policies outperform LRU
‰ In fact, those apps have approx. bimodal stack distance profiles
67 68

17
Introduction
„ Analyical CAche Performance Prediction (ACAPP)
tool suite
How to Use ACAPP ‰ Prediction for different cache associativities
‰ Prediction for different cache replacement polices
‰ Prediction for cache contention when two threads share the cache
‰ Adding new replacement policies with user-specified RPFs
„ Input
‰ Circular Sequence profile of each application
„ Generated by any simulator follow certain format
„ Providing extension code for SimpleScalar
„ Released
‰ Available for download in
http://www.ece.ncsu.edu/arpers

70

Prepare Input Files Startup


# ./acapp –h
$HOME/Simplesim-3.0/
$HOME/acapp/addOnSimpleScalar/
acappProfiler.c
******** ACAPP TOOL HELP MENU ********
acappProfiler.h General Usage:
cache.c -h --- HELP MENU
cache.c
sim-outorder.c Prediction under varying cache associativity:
sim-outorder.c
Makefile -a <assoc> [<min assoc> <max assoc>] -f1 <profile1>
Makefile
Prediction under varying cache replacement policies:
-p <rpindex> -f1 <profile1>
#sets 1024 -pA -f1 <profile1>
# cd $HOME/Simplesim-3.0 #assoc 4 Prediction under cache sharing:
#scaling_factor 4 -c -f1 <profile1> -f2 <profile2>
# make #block_size 64 Adding new replacement policy:(require 'usr_rp.in')
# ./sim-outorder swim.ref.eio –max:2000000000 -n (default) or
#cseq
-n <dx> <nxmin> <nxmax>
…<benchmark output> 1 2 3 4 5 6 7 8 9 10 11 23 34 29 35 35
Print supported replacement policies
#stackDist -log
# ls swim_train.eio.*
6877989 4199671 2083871 2653828 2327123
-rw-r--r-- swim.ref.eio.csq 3051944 5588497 5034757 2996018 794764
39391 520 387 350 383 465 32532308

71 72

18
Prediction under varying cache associativity Print supported replacement policies

acapp -a <assoc> [<min assoc> <max assoc>] -f1 <profile1> acapp -log
EXAMPLE
EXAMPLE
#./acapp -log
#./acapp -a 4 7 -f1 ./csq/benchmark1.csq
OUTPUT
OUTPUT
* * * * SUPPORTED REPLACEMENT POLICIES LOGFILE * * * *
The CSQ file is generated using the following cache parameters:
1 - NMRU4
Sets: 1024
2 - NMRU1
Associativity: 4
3 - LRUskw
Block size: 64
Original miss rate: 0.768043 4 - MRUskw
Miss rate for A = 4: 0.768043
Miss rate for A = 5: 0.733912 Prepl(i)
Miss rate for A = 6: 0.689150 1 1 1 1 1
LRU NMRU1 NMRU4 0.8 Rand-MRUskw 0.8 Rand-LRUskw
Miss rate for A = 7: 0.607186 0.6 0.6
0.5 0.5 0.5 0.4 0.4
0.2 0.2
Note: benchmark1.csq represents the L2 profile of swim (ref input set). 0 0 0 0 0
benchmark2.csq represents the L2 profile of apsi (ref input set). 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
stack position i stack position i stack position i stack position i stack position i
benchmark3.csq represents the L2 profile of ammp (ref input set).

73 74

Prediction under varying cache replacement Prediction for all supported replacement
policies policies
acapp -pA -f1 <profile1>
acapp -p <rpindex> -f1 <profile1>
EXAMPLE
EXAMPLE #./accap -pA -f1 ./csq/benchmark3.csq
#./acapp -p 2 -f1 ./csq/benchmark1.csq
OUTPUT
The CSQ file is generated using the following cache parameters:
OUTPUT
Sets: 1024
The CSQ file is generated using the following cache parameters:
Associativity: 8
Sets: 1024
Block size: 64
Associativity: 4
Block size: 64
1 - NMRU4 3 - LRUskw
Prediction Result for NMRU1
Prediction Result for NMRU4 Prediction Result for LRUskw
LRU: 0.768043
LRU: 0.702653 LRU: 0.702653
Pred: 0.739310
Pred: 0.406974 Pred: 0.376921
******************************* *******************************

2 - NMRU1 4 - MRUskw
Prediction Result for NMRU1 Prediction Result for MRUskw
LRU: 0.702653 LRU: 0.702653
Pred: 0.331667 Pred: 0.201158
******************************* *******************************
75 76

19
Prediction under cache contention Adding new replacement policy
require “usr_rp.in” file:
#This is the configuration file of user specified replacement policy.
accap -c -f1 <profile1> -f2 <profile2> #Please do no change the format of this file and the NAME, ASSOC or PROB keywords. Only
#the values of each of line can be changed by the user.
EXAMPLE NAME newRp
ASSOC 8
#./accap -c -f1 ./csq/benchmark1.csq -f2 ./csq/benchmark2.csq PROB
0.2 0.1 0.05 0.25 0.15 0.07 0.03 0.15
OUTPUT

The CSQ file is generated using the following cache parameters: accap - n (default)
Sets: 1024 EXAMPLE
Associativity: 4
Block size: 64 #./accap –n
******** RESULTS ********
OUTPUT
./csq/benchmark1.csq ./csq/benchmark2.csq Creating new replacement policy...
Accesses: 68182266 11805186 Coefficient file(s) added to: ./fine/newRp/d_0
Predicted miss rate: 0.868560 0.338920 Coefficient file(s) added to: ./fine/newRp/d_1

Original miss rate: 0.768043 0.244366 …

Coefficient file(s) added to: ./fine/newRp/d_11


Replacement policy: '5 - newRp' Added Successfully!

77 78

Adding new replacement policy (continue) Acknowledgement


By using ./acapp –n (default), a certain number of coefficient files that „ Researchers
cover the most popular combinations of d and n are generated. In case
some benchmarks’ cseq profiles have special combinations of d and n, ‰ Fei Guo
(reported as “Missing Coefficient files” by the tool if any). User can also ‰ Dhruba Chandra
generate coefficient files for certain d and n.
‰ Shaunak Joshi
accap - n <dx> <nxmin> <nxmax>
EXAMPLE
‰ Seongbeom Kim
#./accap 5 7 9 „ Funding Agency
OUTPUT ‰ NSF
Adding coefficient file(s) to rp:newRp
‰ Intel
Coefficient file(s) added to directory: ./fine/newRp/d_5
Coefficient files for Replacement policy: '5 - newRp' Added Successfully! ‰ IBM

79 80

20
The Effects of Miss Clustering on
the Cost of a Cache Miss Acknowledgments
Arthur Nadas
Jim Mitchell
Phil Emma Jane Bartik
Allan Hartstein Dan Prener
Thomas R. Puzak Peter Oden
Doug Logan
Viji Srinivasan
John Griswell
C R Attanasio
Danny Lynch
Dept: Systems Technology and Microarchitecture
Moin Qureshi
IBM T. J. Watson Researh Center

How do You Measure The Cost of a Cache Miss?


Pipeline Spectroscopy is a new technique that allows
Processor Performance
classification (analysis) of single events in a Processor's Pipeline. (cycles/instruction) has two
Initial focus of project was to develop a means to measure the components
cost of a cache miss but spectroscopy leads to much greater
insight in pipeline dynamics, including effects due to cache Cycles
Cycles Cycles
miss behavior, prefetching, pipeline recycle, branch prediction Instructions = Instructions + Instructions
errors, and trailing edge effects.
cache
Finite Cache Adder
Cost of each miss is displayed as a Histogram. The graphs are
called spectrograms because they reveal certain signature
Figure of merit Figure of merit
features of the processor's memory hierarchy, the pipeline,
Overall for processor for cache and
and the miss pattern itself (amount of overlap between misses
design memory
in the miss cluster).
hierarchy
Memory Hierarchy Substituting
(cycles/instruction) has two Cycles Cycles Cycles Misses
Instructions = Instructions + Instructions
components cache
Miss
Overall Finite Cache Adder

Cycles Cycles Misses


Instructions = Miss Instruction
To Calculate Miss Penalty (Cycles/Miss)
Finite Cache Adder
Cycles Cycles
Instructions - Instructions
Due to: Due to:
Overall cache
Total Memory Memory speed, Cache Size,
Cycles
Effect Line size, Line size, = Miss
Bussing and Replacement Algorithm Misses
Overlap Instructions

Constructing Miss Spectrogram


Constructing Miss Spectrogram I1 I2 I3 I4 I5 time
t1 infinite cache

Misses are grouped into clusters and the run time


associated with the instruction sequence that 'surrounds' time
I1 I2 (I1) I3 (I2) (I3) I4 I5 (I4) (I5)
the miss cluster is compared to the infinite cache run time t2 finite cache
of the same instruction sequence. No Miss
miss miss miss miss
Activity
Miss Cluster Size = 1
Miss Cluster Size = 3
The difference between these two times is used to
Miss Cluster Size = Number of Misses During Miss Facility Busy Interval
construct the miss spectrogram
[(I3-I1)FC - (I3-I1)IC] + [(I5-I4) FC - (I5-I4)IC] + [.......] =
Finite Cache Adder
Ê1= (t2-t1) for the ith cluster S Ê i = (CYC
FC
CYCIC )
20%
15% L2 = 15 Cycles
Cluster Size = 1

Percent
10%
L3 = 75 Cycles
5%
0% Memory = 300 Cycles
20% 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
15% L2 = 15 Cycles Cluster Size = 1 12%
Percent

L3 = 75 Cycles 10%
10% 8%
Cluster Size = 2

Percent
Memory = 300 Cycles 6%
4%
5% 2%
0%
0% 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875
6%
5%
4%
Cluster Size = 3

Percent
3%
2%
1%
0%
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875

3%
3%
2%
2%
Cluster Size = 4
1%
1%
0%
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900
25 75 125 175 225 275 325 375 425 475 525 575 625 675 725 775 825 875

Number of peaks in a miss spectrograph


Pipeline Spectroscopy Tool Set
L = Memory Hierarchy Levels
C = Miss Cluster Spectroscopy
Miss/Cost
N = Number of Peaks Timer Program
Report

C+L
( )
C
=N Decode/Endop
times, Miss Info

Miss Addr/Inst Addr


Sort
3%

3%

2%
Cluster Size = 4 Report Output Miss
Spectrogram
2%

1%

1% Cost/Analysis
0% Report
540 600 660 720 780 840 900 960 1020 1080 1140 1200 1260
570 630 690 750 810 870 930 990 1050 1110 1170 1230
Miss/Cost Report
Cost Analysis Report
Highest count items: Highest cost items:
Cluster Size Cost Infimum Supremum Miss Inst Inst ASID Inst Addr Count % of Total ASID Inst Addr Total Cost % of Total
012E 93DAA 4410 0.49% 0146 9724EA6 144045 0.33%
Num Inst Inst Address Address Number 0134 9CDAA 3854 0.43% 0146 3BE10A3E 142423 0.32%
0146 3B46E 3023 0.34% 0146 3BE10A02 139804 0.32%
0146 9724EA6 2596 0.29% 0146 96F2A68 136345 0.31%
2797 3 45 348389 348417 20B1FA20 000668EC 348389 012E 3CF3DD62 2444 0.27% 012E 93DAA 125253 0.28%
2797 3 45 348389 348417 20B1FB40 000668EC 348401
2797 3 45 348389 348417 20B1FC00 000668EC 348409

2798 3 45 348421 348449 20B1FD20 000668EC 348421


2798 3 45 348421 348449 20B1FE40 000668EC 348433
2798 3 45 348421 348449 20B1FF00 000668EC 348441

Misses By Cluster size Base Case


0.5
Cycle Per Miss and
0.4

Fraction of Misses
Cluster Size Analysis
0.3

0.2

0.1

0
0 2 4 6 8 10 12 14 16 18 20
1 3 5 7 9 11 13 15 17 19 21
Cluster Size
How do you use a spectrogram?
Cycles Per Miss Versus Cluster Size
80 Analyze Prefetching algorithm
70 Hardware or Software
Cycles Per Miss

60

50 Analyze a hardware Design


40

30 Analyze Cluster Patterns and Cost of a Miss


20
1 2 3 4 5 6 7 8 9 10
Science or Theory of Misses
Cluster Size

Theory:
Theory:
Can we Predict the shape of a spectrogram (cluster size = 3, 4, or 5)
From analyzing smaller miss clusters? Can we Predict the shape of a spectrogram (cluster size = 3, 4, or 5)
From analyzing smaller miss clusters?
Observations:
30%
Observations:
20% L2 = 15 cycles Cluster Size = 1
10% Memory = 100 cycles
For cluster size = 1, we can have a Hit or Miss in the L2
0%
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 H or M
9%
6% Cluster Size = 2 2
For cluster size = 2, we can have 2 = 4 possible outcomes.
3%
HH,HM, MH, MM
0%
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450
3
4% For cluster size = 3, we can have 2 = 8 possible outcomes.
Cluster Size = 3 HHH, HHM, HMH, HMM, MHH, MHM, MMH, MMM,
2%
0% 4
For cluster size = 4, we have 2 =16 possible outcomes
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450
6% HHHH, .... .... MMMM
4%
Cluster Size = 4
2%
0% C
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 For cluster of size C there are 2 possible outcomes.
Determine unique sums for N items choosing C
Number of Peaks in spectrogram with C clusters and N levels
C
90
L=3 S N
=
N
+
N
+
N
+
N
+ +
N
80
70
L=1
Peaks = ( )
C+N
C
i=0
i 0 1 2 3 C
Number of Peaks

c
2 C=Cluster Size

60
L=2
Hit/Miss
Using Relation
N
k
= ( )
N+K-1
K

= 2C
50 L=3 Combinations
40 = ( ) () ( ) ( )
N-1
0 +
N
1 +
N+1
2 +
N+2
3 + +
( )
N+C-1
C

30 H/M Combinations
L=2
20
We can combine first two term using
( )=( ) ( )
N
K
N-1
K-1 +
N-1
K (1a)
10 L=1
0
= () () ()
N+1
1 +
N+1
2 +
N+2
3 + + ( )
N+C-1
C

1 2 3 4 5 6
Cluster Size Applying 1a repeatedly, the series collapses to ( ) N+C
C

Theory: Predicting the Shape of a


Spectrogram Theory: Predicting the Shape of a
From
Observations: a Smaller
Configure Cluster
L2 to have 50/50Spectrogram
% Hit/Miss Spectrogram
Probabilities From a Smaller Cluster Spectrogram
Observations: Configure L2 to have 50/50 %
If Hit and Misses in the L2 were independent then we could Hit/Miss Probabilities
use the binomial distribution to describe H/M clusters

Hit/Miss Probabilities for Cluster Sizes = 2


For example, if Pr[M] = p then Pr[H] = (1-p) then probability
of k misses in a cluster of size N is
and 3
k N-k
( ) p (1-p)
N
k
Cluster = 2 HH HM MH MM
Prob's .306 .175 .188 .331
So in a cluster of size 1 Pr[M] = Pr[H] = .50

CL 3 HHH HHM HMH HMM MHH MHM MMH MMM


Then in cluster of size 2
Prob .208 .105 .090 .098 .103 .094 .083 .219
Pr[HH] = Pr[HM] = Pr[MH] = Pr[MM] = .25
MM Cluster Overlap = 35/65 MH Cluster Overlap = 62/38
0.07 0.03 MHM Cluster Overlap = 19 13 45 23
0.06 0.025 MMM Cluster Overlap = 15 42 43 0.01

0.03 0.008
0.05
Percent
0.02

Percent

Percent
0.025
0.04 0.006

Percent
0.02
0.015 0.015 0.004
0.03
0.01 0.002
0.02 0.01
0.005
0
0.01 0.005 0 90 120 150 180 210
50 100 150 200 250 300 350 105 135 165 195 225
0 0 cluster=3 cluster=3
50 100 150 200 250 70 80 90 100 110 120 130
Cluster = 2 Cluster = 2
HM Cluster Overlap = 34/66 HH Cluster Overlap = 45/55
0.05 0.08 HHH Cluster Overlap = 12 48 40
0.07 0.04
0.04 HMH Cluster Overlap = 21 46 33
0.06
0.01 0.03
Percent

Percent
Percent
0.03 0.05
0.008
0.02

Percent
0.04
0.006
0.02
0.03 0.004 0.01

0.02 0.002
0.01 0
0.01 0 0 10 20 30 40 50 60
80 90 100 110 120 130 140 150 cluster=3
0 0
cluster=3
70 80 90 100 110 120 130 0 10 20 30 40 50 60
Cluster = 2 Cluster = 2

Individual Miss Spectrogram for Cluster Size = 2, L1=64KB,


Figure 7. Individual Miss Spectrogram for Cluster Size = 3, L1=64KB,
L2=256KB 15 Cycle Latency, L3 = 100 Cycles Latency, Data for OLTP
L2=256KB 15 Cycle Latency, L3 = 100 Cycles Latency, Data for OLTP

What is a?

Let Xi represent ith hit or miss and let 0 represent a hit and
Let Prob[M] = P Then Prob[Hit] = (1-P) 1 represent a miss, then Xi =M or Xi=H is equivalent to Xi=0 or Xi=1.
By definition the correlation coefficient
If Hit and Miss Probabilities are independent then
q= COV(X1,X 2)

Prob[HH] = Prob[HM] = Prob[MH] = Prob[MM] = .25 Var(X 1)Var(X2 )

COV(X1,X2) = E[X1,X2] - E[X1]E[X2]


+
Prob[MM] = PxP E[X1,X2] = Pr[X 1X 2=1] = Pr[X 1=1,X2=1] = p(1-a+ap)
E[X1] = E[X2] =p
COV(X1,X2) = p(1-a+ap)-p =(1-a)[p(1-p)]
2
In Order to increase Prob of MM we define a
Var(X1) = Var(X 2) = E[(X 1 – l ) ] = E[X 1 ] - p = p(1-p)
2 2 2

and form Convex Combination between P and 1 (1-a)[p(1-p)]


q = p(1-p)
aP + (1-a) = 1-a+aP
q = (1-a) or a = (1-q)

so Prob[MM] = P x (1- a+aP)


+
Let Pr[M] = P and Pr[H] = (1-P). We seek Function that increases Pr[MM] > PxP. Same technique is used to increase Pr[HH] > (1-P)x(1-P) .
To Increase Pr[X 2=M | X1=M], define a, 0 [ a [ 1 and form convex combination between To Increase Pr[X2=H | X 1=H], define a, 0 [ a [ 1 and form convex combination between
values P and 1 such that Pr[X2=M | X1 =M] = ap + (1-a) = (1- a+ap) values (1-P) and 1 such that Pr[X2 =H | X1=H] = (1-P) a + (1- a) = (1-aP)

Note the function (1-a+ap) has the properties we desire. For 0 [ a [ 1 and P [ 1, (1- a+ap) m P. Note the function (1-aP ) has the properties we desire. For 0 [ a [ 1 and P [ 1, (1-aP ) m (1-P).

So Pr[MM] = P (1-a+ap) So Pr[HH] = (1-P) (1-aP )

Also for a = (1- q), as q d 1, (the correlation between XiXi+1 becomes stronger) a d 0 and Also for a = (1- q), as q d 1, (the correlation between XiXi+1 becomes stronger) a d 0 and
Pr[X2=M | X1 =M] = (1-a+ap) d 1. Perfect Correlation Pr[X2=H | X1=H] = (1-aP ) d 1. Perfect Correlation

For q d 0, (the correlation between XiXi+1 becomes weaker) a d 1 and For q d 0, (the correlation between XiXi+1 becomes weaker) a d 1 and
Pr[X2=M | X1 =M] = (1-a+ap) d P. As if they are independent. Pr[X2=H | X1=H] = (1-aP ) d P. As if they are independent.

For example, H/M sequences HMM, MHHM,


and HMMMH then have the prob
In Order to increase Prob of HH we define
(1-p)(ap) (1-a+ap) and (p)(a-ap)(1-ap)(ap) and
form Convex Combination between (1-P) and
1 (1-p) (ap) (a-ap) (1-a+ap)2
a = 1-ap
(1-P)+(1-a) H/M probabilities of any length can be calculated using
Similarly Prob[HH] = (1-P) x (1-ap) the generating function

Xi+1 (1-p) (1-ap)w (ap)x (a-ap)y (1-a+ap)z if it starts with a hit

Hit Miss and (p)(1- ap)w (ap)x (a-ap)y (1-a+ap)z if it starts with a miss

Hit 1-ap ap Xi+1


Xi Hit Miss
Miss a-ap 1-a+ap
Hit 1-ap ap
Probability State Transition Matrix For HH, Xi
MM, MH, and HM Events
Miss a-ap 1-a +ap
Using The Transition Matrix we can Generate
a Hit/Miss Probability for any size cluster using Theory: Predicting the Shape of a
Spectrogram
From a Smaller Cluster Spectrogram
(1-ap)w (ap)x (a-ap)y (1-a+ap)z
Cluster = 2 HH HM MH MM

where w + x + y + z = C - 1 where C=Cluster Size Real .306 .175 .188 .331


Predicted .308 .182 .182 .328

Find Value of a That is 'best fit' from cluster =


2
Data. Then generate Hit/Miss sequences for CL 3 HHH HHM HMH HMM MHH MHM MMH MMM
larger Real .208 .105 .090 .098 .103 .094 .083 .219
Xi+1 Pred .193 .115 .065 .117 .115 .068 .117 .210
Cluster
Hit Miss

Hit 1-a p ap
Xi
Miss a- ap 1-a+ap

Cluster = 2 Cluster = 3 How do we determine Overlap?


8 7
Measure overlap/no-overlap for cluster size = 2
Number of Workloads

Number of Workloads

7 6
6 5
5
4
4 MM Cluster Overlap = 35/65 MH Cluster Overlap = 62/38
3 0.03
3 0.07

2 2 0.06 0.025

1 1 0.05 0.02

Percent

Percent
0 0.04
0 0.015
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.03

Maximum Difference 0.02 0.01


Maximum Difference
0.01 0.005
Cluster = 4 Cluster = 5 0
50 100 150 200 250 0
6 5 70 80 90 100 110 120 130
Cluster = 2
Number of Workloads

Number of Workloads

Cluster = 2
5 4 HM Cluster Overlap = 34/66 HH Cluster Overlap = 45/55
0.05 0.08
4 3
0.07
0.04
3 0.06
2

Percent

Percent
0.03 0.05
2 0.04
1
0.02
1 0.03

0 0.02
0.01
0 0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0
Maximum Difference 70 80 90 100 110 120 130
0
Maximum Difference 0 10 20 30 40 50 60
Cluster = 2 Cluster = 2
Kolmogorov-Smirnov Test For Predicted Hit/Miss Probabilities For
Clusters = 2, 3, 4, and 5. Individual Miss Spectrogram for Cluster Size = 2, L1=64KB,
L2=256KB 15 Cycle Latency, L3 = 100 Cycles Latency, Data for OLTP
Cluster Size = 2 X1 X1X2 X3=HHH X1 X2 Cluster Size = 3
=.193 OVL
NO
OVL
.45 .55
Hit Miss
15 30
X2 X2X3 X2X3
X 2
OVL NO OVL
NO
OVL
OVL
.45 .55 .45 .55
Hit Miss Hit Miss
15 30 30 45
HH .306 HM .188 MH .175 MM .331 .45X.45X.193= .45X.55.X193= .55X.45X.193= .55X.55X.193=
NO .039 .048 .048 .059
OVL OVL OVL NO OVL NO OVL NO
.45 .55 .34 .66 OVL .62 .38OVL .35 .65OVL
X1X2 X3=MMH X1X2
15 30 100 115 100 115 100 200
=.117 OVL
NO
.35 .65 OVL
.306X.45= .306X.55= .188X.34= .188X.66= .175X.62= .175X.38= .331X.35= .331X.65=
.138 .168 .064 .124 .109 .067 .116 .215 100 200
X2X3 X2 X3 NO
OVL NO OVL
OVL OVL
.62 .38 .62 .38
15 30 100 115 200
100 215
115 200
.138 .168 .288 .191 .215 .35X.62X.117= .35X.38X.117= .65X.62X.117= .65X.65X.117=
Constructing a Miss Spectrogram. HH, MM, HM, MH, and overlap .025 .016 .047 .049
Probabilities for Cluster = 2 Spectrogram. Data is for OLTP workload Overlap/No-Overlap Probability Tree For HHH and MMH.
Data For OLTP

Cluster = 2 Cluster = 3
0.07 7
8

Number of Workloads

Number of Workloads
Percent of Misses

0.06 7 6
0.05 6
0.04
Cluster = 3 5
5
4
0.03 4
3
0.02
3
2 2
0.01
1 1
0
0 30 60 90 120 150 180 210 240 270 300 330 0 0
0.01 0.04 0.07 0.1 0.13 0.16 0.19 0.01 0.04 0.07 0.1 0.13 0.16 0.19
15 45 75 105 135 165 195 225 255 285 315
Maximum Difference Maximum Difference
0.06
Cluster = 4 Cluster = 5
Percent of Misses

0.05
Cluster = 4 6 5

Number of Workloads

Number of Workloads
0.04
5 4
0.03
4 3
0.02
3
2
0.01 2
1
0 1
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 0
15 45 75 105 135 165 195 225 255 285 315 345 375 405 435 0 0.01 0.04 0.07 0.1 0.13 0.16 0.19
0.01 0.04 0.07 0.1 0.13 0.16 0.19
Miss Penalty Maximum Difference
Predicted Spectrogram for Clusters = 3 and 4. L1=64KB, Maximum Difference
L2=256KB Kolmogorov-Smirnov Test For Predicted Hit/Miss Probabilities For
15 Cycles Latency, L3=100 Cycle Latency, Data For OLTP Clusters = 2, 3, 4, and 5.
Conclusions
Agreement between and theory and the experiment is very good.

Model uses six parameters


Pipeline Spectroscopy allows us to apply Statistical
Probabilities to analyze miss behavior in a Multilevel Memory
Hierarchy
L2 H/M rate
A Correlation Parameter
4 Overlap/Non-Overlap parameters gleaned from
Same analysis possible to determine Branch Wrong Guess
penalty, software analysis, run ahead, or AGI Penalty
the four H/M patterns in a Cluster Size = 2

Pipeline Spectroscopy allows quantitative analysis of individual


Further analysis needed to gain insight into overlap/non-overlap
miss penalties to considerable precision.
Programming Structure, and Hardware Limitations
Provides means to quantitatively compare different prefetching
Future work needed to study
and memory hierarchy analysis
In-order versus Out-Of-Order Spectrogram differences
SMT, and MLP organizations.

You might also like