Average Memory Access Time Analysis

8/31/2016
Average Memory Access Time
CS528 Processor Cache

Cache Performance &
p
Serial Code Optimization
• Processor give 1000 reference, 90% hit, 10 miss in cache
(miss probability m=0.1)
• Hit time 1 cycle, Miss time = 10 cycle
A Sahu • AMAT = TimeHit + RateMiss ∙ TimeMiss
Dept of CSE, IIT Guwahati AMAT = 1 + miss_probablity*miss_penality
= 1 + 0.1*10 = 2
A Sahu 1 Total Mem access time :AMAT*Total
A Sahu ref =2*1000=2000cycle 2
Multi Level Hierarchy Memory Hierarchy Analysis
32MB
4GB 1TB Access time ti : τ1+ τ2+ …. +τi (τi at level i)
512KB 2MB τ1< τ2< …. < τn
HARD Hit ratios hi(si): h1< h2< …. < hn = 1
L1 L2 L3
Processor
Cache Cache Cache DISK Effective time Teff: ∑i mi . hi . ti = ∑i mi . τI
DRIVE
DRIVE
AMAT= Hit timeL1 + Miss rateL1 x
1 Cycle 10 Cycle 80 Cycle 1000 Cycle (Hit timeL2 + Miss rateL2 x Miss penaltyL2)
1ns 10ns 80ns 1micro S
Memory Mi: M1, M2, …. , Mn Cascading
100,000 Cycle Local miss rate
Capacity si: s1< s2< …. < sn 1 mili Sec
Unit cost ci: c1> c2> …. > cn
Total cost Ctotal: ∑ c . si
A Sahui i 3 A Sahu 4
Instruction | Data | Unified | Split
• Placement what gets placed where?
Split vs. Unified:
• Split allows specializing each part
• Read when? from where?
• Unified allows best use of the capacity • Load order of bytes/words?
On‐chip | Off‐chip • Fetch when to fetch new block?
• on‐chip : fast but small • Replacement which one?
• off‐chip : large but slow
• Write when? to where?
Single level | Multi level
A Sahu slide 5 A Sahu slide 6
1
8/31/2016
Eleven Advanced Optimization for
Cache Performance
• Reducing hit time
Eleven Advanced Optimization for • Reducing miss penalty
Cache Performance • g
Reducing miss rate
• Reducing miss penalty * miss rate
Ref: 5.2, Computer Architecture: A Quantitative
Approach, Hennessy Patterson Book, 4th Edition
A Sahu 7 A Sahu 8
Reducing Hit Time Small and Simple Caches
• Small and simple caches • Small size => faster access
• Pipelined cache access • Small size => fit on the chip, lower delay
• Avoid time loss in address translation • Simple (direct mapped) => lower delay
– Virtually indexed, physically tagged cache • Second level – tags may be kept on chip
• simple and effective approach
• possible only if cache is not too large
– Virtually addressed cache
• protection?, multiple processes?, aliasing?, I/O?
A Sahu 9 A Sahu 10
Pipelined Cache Access Reducing Miss Penalty
• Multi‐cycle cache access but pipelined • Multi level caches
• reduces cycle time but hit time is more than • Critical word first and early restart
one cycle
• gp y
Giving priority to read misses over write
• Pentium 4 takes 4 cycles
i 4 k 4 l
• Greater penalty on branch misprediction
• Victim caches
• More clock cycles between issue of load and
use of data
– IF IF IF in pipeline
A Sahu 11 A Sahu 12
2
8/31/2016
Multi Level Hierarchy Multi Level Caches
32MB
4GB 1TB
512KB 2MB Average memory access time =
HARD Hit timeL1 + Miss rateL1 * Miss penaltyL1
L1 L2 L3
Processor
Cache Cache Cache DISK
DRIVE
DRIVE Miss penaltyL1 =
Miss penalty =
Hit timeL2 + Miss rateL2 * Miss penaltyL2
1 Cycle 10 Cycle 80 Cycle 1000 Cycle
1ns 10ns 80ns 1micro S
Memory Mi: M1, M2, …. , Mn
100,000 Cycle
Capacity si: s1< s2< …. < sn 1 mili Sec
Unit cost ci: c1> c2> …. > cn
Total cost Ctotal: ∑ c . si
A Sahui i 13 A Sahu 14
Read Miss Priority Over Write Victim Cache: Recycle bin/Dust bin
• Provide write buffers • Evicted blocks are recycled to proc
• Processor writes into buffer and proceeds (for • Much faster than getting a
write through as well as write back) block from the next level Cache
• Size = 1 to 5 blocks
Size = 1 to 5 blocks
On read miss • A significant fraction of
misses may be found in Victim
– wait for buffer to be empty, or Cache
victim cache
– check addresses in buffer for conflict
from mem
A Sahu 15 A Sahu 16
Reducing Miss Rate Large Block Size
• Large Block Size • Take benefit of spatial locality
• Larger Cache • Reduces compulsory misses
• Higher Associativity • Too large block size ‐ misses increase
• Way prediction and pseudo‐associative cache • Miss Penalty increases
• Compiler optimizations
A Sahu 17 A Sahu 18
3
8/31/2016
Large Cache Higher Associativity
• Reduces capacity misses • Reduces conflict misses
• Hit time increases • 8‐way is almost like fully associative
p g
• Keep small L1 cache and large L2 cache • Hit time increases: What to do ?
–Pseudo Associativity
A Sahu 19 A Sahu 20
Compiler optimizations Improving Locality
• Loop
Loop interchange
interchange Matrix Multiplication example
[C ] = [ A] × [B ]
– Improve spatial locality by scanning arrays row‐
wise
• Blocking
– Improve temporal and spatial locality
L× M L× N N ×M
A Sahu 21 A Sahu 22
Cache Organization for the example Matrix Multiplication : Code I
• Cache line (or block) = 4 matrix elements.
• Matrices are stored row wise. for (i = 0; i < L; i++)
• Cache can’t accommodate a full row/column. for (j = 0; j < M; j++) i j k
for (k = 0; k < N; k++)
– L, M and N are so large w.r.t. the cache size
C[i][j] += A[i][k] * B[k][j];
– After an iteration along any of the three indices
indices,
when an element is accessed again, it results C A B
in a miss.
accesses LM LMN LMN
• Ignore misses due to conflict between matrices. misses LM/4 LMN/4 LMN
– As if there was a separate cache for each
matrix. Total misses = LM(5N+1)/4
A Sahu 23 A Sahu 24
4
8/31/2016
Matrix Multiplication : Code II Matrix Multiplication : Code III
for (k = 0; k < N; k++) for (i = 0; i < L; i++)
for (i = 0; i < L; i++) k i j for (k = 0; k < N; k++) i k j
for (j = 0; j < M; j++) for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j]; C[i][j] += A[i][k] * B[k][j];
C A B C A B
accesses LMN LN LMN accesses LMN LN LMN
misses LMN/4 LN LMN/4 misses LMN/4 LN/4 LMN/4
Total misses = LN(2M+4)/4 Total misses = LN(2M+1)/4
A Sahu 25 A Sahu 26
Code to run to you PC Reducing Miss Penalty * Miss Rate
• Download Program code from
• Non‐blocking cache
http://jatinga.iitg.ernet.in/~asahu/cs528/Matrix
MulIJK‐MissCall‐Valgrind.tgz • Hardware prefetching
• Compiler controlled prefetching
–RRun $sh
$ h Run.sh
R h
– or $./Run.sh (need to change permission chmod +rx
Run.sh)
• This report LD1miss for both the codes (MartixIJK
and MatrixIkJ).
• Require valgrind to be installed in your system
– It is preinstalled in most of the Linux version
A Sahu 27 A Sahu 28
Non‐blocking Cache Hardware Prefetching
In OOO processor • Prefetch items before they are requested

– both data and instructions
• What and when to prefetch?
• Hit under a miss
– fetch two blocks on a miss (requested+next)
f h bl k i ( d )
– complexity of cache controller increases
• Where to keep prefetched information?
• Hit under multiple misses or miss under a miss
– in cache
– memory should be able to handle multiple misses
– in a separate buffer (most common case)
A Sahu 29 A Sahu 30
5
8/31/2016
Prefetch Buffer/Stream Buffer MatMul: Code III

for (i = 0; i < L; i++) i k j
for (k = 0; k < N; k++)
to processor for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j];
C A B
Cache
accesses LMN LN LMN
misses LMN/4 LN/4 LMN/4
Total misses = LN(2M+1)/4
prefetch Suppose 3 Separate Prefetcher for A, B and C
buffer All the 3 block can be brought to buffer & one
swap out in Te, Te = 4 time execution of stmt;
Over 3=1+1+1
from memory How manyy number Miss ?
A Sahu 31 A Sahu 32
Compiler Controlled Pre‐fetching
• Semantically invisible (no change in registers
or cache contents)
• Makes sense if processor doesn’t stall while
prefetching (non‐blocking cache)
• Overhead of prefetch instruction should not
exceed the benefit
PreFecth(A[i]); // Prefetch X= A[i]; // Prefetch Instr
STMT; STMT;
STMT; STMT;
STMT; STMT;
Y+A[i]; // Using Data Y+A[i]; // Using Data
Z=Y+K; A Sahu
Z=Y+K; 33

Average Memory Access Time Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Average Memory Access Time Analysis

Uploaded by

Copyright:

Available Formats

8/31/2016

CS528 Processor Cache

A Sahu slide 5 A Sahu slide 6

Total misses = LN(2M+4)/4 Total misses = LN(2M+1)/4

In OOO processor • Prefetch items before they are requested

Prefetch Buffer/Stream Buffer MatMul: Code III

You might also like