Professional Documents
Culture Documents
Average Memory Access Time
Multi Level Hierarchy Memory Hierarchy Analysis
32MB
4GB 1TB Access time ti : τ1+ τ2+ …. +τi (τi at level i)
512KB 2MB τ1< τ2< …. < τn
HARD Hit ratios hi(si): h1< h2< …. < hn = 1
L1 L2 L3
Processor
Cache Cache Cache DISK Effective time Teff: ∑i mi . hi . ti = ∑i mi . τI
DRIVE
DRIVE
AMAT= Hit timeL1 + Miss rateL1 x
1 Cycle 10 Cycle 80 Cycle 1000 Cycle (Hit timeL2 + Miss rateL2 x Miss penaltyL2)
1ns 10ns 80ns 1micro S
Memory Mi: M1, M2, …. , Mn Cascading
100,000 Cycle Local miss rate
Capacity si: s1< s2< …. < sn 1 mili Sec
Unit cost ci: c1> c2> …. > cn
Total cost Ctotal: ∑ c . si
A Sahui i 3 A Sahu 4
Instruction | Data | Unified | Split
• Placement what gets placed where?
Split vs. Unified:
• Split allows specializing each part
• Read when? from where?
• Unified allows best use of the capacity • Load order of bytes/words?
On‐chip | Off‐chip • Fetch when to fetch new block?
• on‐chip : fast but small • Replacement which one?
• off‐chip : large but slow
• Write when? to where?
Single level | Multi level
1
8/31/2016
Eleven Advanced Optimization for
Cache Performance
• Reducing hit time
Eleven Advanced Optimization for • Reducing miss penalty
Cache Performance • g
Reducing miss rate
• Reducing miss penalty * miss rate
Ref: 5.2, Computer Architecture: A Quantitative
Approach, Hennessy Patterson Book, 4th Edition
A Sahu 7 A Sahu 8
Reducing Hit Time Small and Simple Caches
• Small and simple caches • Small size => faster access
• Pipelined cache access • Small size => fit on the chip, lower delay
• Avoid time loss in address translation • Simple (direct mapped) => lower delay
– Virtually indexed, physically tagged cache • Second level – tags may be kept on chip
• simple and effective approach
• possible only if cache is not too large
– Virtually addressed cache
• protection?, multiple processes?, aliasing?, I/O?
A Sahu 9 A Sahu 10
Pipelined Cache Access Reducing Miss Penalty
• Multi‐cycle cache access but pipelined • Multi level caches
• reduces cycle time but hit time is more than • Critical word first and early restart
one cycle
• gp y
Giving priority to read misses over write
• Pentium 4 takes 4 cycles
i 4 k 4 l
• Greater penalty on branch misprediction
• Victim caches
• More clock cycles between issue of load and
use of data
– IF IF IF in pipeline
A Sahu 11 A Sahu 12
2
8/31/2016
Multi Level Hierarchy Multi Level Caches
32MB
4GB 1TB
512KB 2MB Average memory access time =
HARD Hit timeL1 + Miss rateL1 * Miss penaltyL1
L1 L2 L3
Processor
Cache Cache Cache DISK
DRIVE
DRIVE Miss penaltyL1 =
Miss penalty =
Hit timeL2 + Miss rateL2 * Miss penaltyL2
1 Cycle 10 Cycle 80 Cycle 1000 Cycle
1ns 10ns 80ns 1micro S
Memory Mi: M1, M2, …. , Mn
100,000 Cycle
Capacity si: s1< s2< …. < sn 1 mili Sec
Unit cost ci: c1> c2> …. > cn
Total cost Ctotal: ∑ c . si
A Sahui i 13 A Sahu 14
Read Miss Priority Over Write Victim Cache: Recycle bin/Dust bin
• Provide write buffers • Evicted blocks are recycled to proc
• Processor writes into buffer and proceeds (for • Much faster than getting a
write through as well as write back) block from the next level Cache
• Size = 1 to 5 blocks
Size = 1 to 5 blocks
On read miss • A significant fraction of
misses may be found in Victim
– wait for buffer to be empty, or Cache
victim cache
– check addresses in buffer for conflict
from mem
A Sahu 15 A Sahu 16
Reducing Miss Rate Large Block Size
• Large Block Size • Take benefit of spatial locality
• Larger Cache • Reduces compulsory misses
• Higher Associativity • Too large block size ‐ misses increase
• Way prediction and pseudo‐associative cache • Miss Penalty increases
• Compiler optimizations
A Sahu 17 A Sahu 18
3
8/31/2016
Large Cache Higher Associativity
• Reduces capacity misses • Reduces conflict misses
• Hit time increases • 8‐way is almost like fully associative
p g
• Keep small L1 cache and large L2 cache • Hit time increases: What to do ?
–Pseudo Associativity
A Sahu 19 A Sahu 20
Compiler optimizations Improving Locality
• Loop
Loop interchange
interchange Matrix Multiplication example
[C ] = [ A] × [B ]
– Improve spatial locality by scanning arrays row‐
wise
• Blocking
– Improve temporal and spatial locality
L× M L× N N ×M
A Sahu 21 A Sahu 22
Cache Organization for the example Matrix Multiplication : Code I
• Cache line (or block) = 4 matrix elements.
• Matrices are stored row wise. for (i = 0; i < L; i++)
• Cache can’t accommodate a full row/column. for (j = 0; j < M; j++) i j k
for (k = 0; k < N; k++)
– L, M and N are so large w.r.t. the cache size
C[i][j] += A[i][k] * B[k][j];
– After an iteration along any of the three indices
indices,
when an element is accessed again, it results C A B
in a miss.
accesses LM LMN LMN
• Ignore misses due to conflict between matrices. misses LM/4 LMN/4 LMN
– As if there was a separate cache for each
matrix. Total misses = LM(5N+1)/4
A Sahu 23 A Sahu 24
4
8/31/2016
Matrix Multiplication : Code II Matrix Multiplication : Code III
for (k = 0; k < N; k++) for (i = 0; i < L; i++)
for (i = 0; i < L; i++) k i j for (k = 0; k < N; k++) i k j
for (j = 0; j < M; j++) for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j]; C[i][j] += A[i][k] * B[k][j];
C A B C A B
accesses LMN LN LMN accesses LMN LN LMN
misses LMN/4 LN LMN/4 misses LMN/4 LN/4 LMN/4
A Sahu 25 A Sahu 26
Code to run to you PC Reducing Miss Penalty * Miss Rate
• Download Program code from
• Non‐blocking cache
http://jatinga.iitg.ernet.in/~asahu/cs528/Matrix
MulIJK‐MissCall‐Valgrind.tgz • Hardware prefetching
• Compiler controlled prefetching
–RRun $sh
$ h Run.sh
R h
– or $./Run.sh (need to change permission chmod +rx
Run.sh)
• This report LD1miss for both the codes (MartixIJK
and MatrixIkJ).
• Require valgrind to be installed in your system
– It is preinstalled in most of the Linux version
A Sahu 27 A Sahu 28
Non‐blocking Cache Hardware Prefetching
A Sahu 29 A Sahu 30
5
8/31/2016
Compiler Controlled Pre‐fetching
• Semantically invisible (no change in registers
or cache contents)
• Makes sense if processor doesn’t stall while
prefetching (non‐blocking cache)
• Overhead of prefetch instruction should not
exceed the benefit
PreFecth(A[i]); // Prefetch X= A[i]; // Prefetch Instr
STMT; STMT;
STMT; STMT;
STMT; STMT;
Y+A[i]; // Using Data Y+A[i]; // Using Data
Z=Y+K; A Sahu
Z=Y+K; 33