EEE415-Week07-Micro Architecture and Memory

DRAFT 7/16/2023
EEE 415 - Microprocessors and Embedded Systems
EEE Advanced Architectures,

Memory
415 Lecture 7.1

Week 7
Dr. Sajid Muhaimin Choudhury, Assistant Professor

Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology
EEE Micro Architecture:

Advanced Architectures,
415 Memory
Lecture 7.1
Week 7

1
DRAFT 7/16/2023
Advanced Microarchitecture
• Deep Pipelining
• Micro-operations
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• Multithreading
• Multiprocessors
Dr. Sajid Muhaimin Choudhury 5

EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
Advanced Microarchitecture EEE 415

• Deep Pipelining
• Micro-operations
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• Multithreading
• Multiprocessors

2
DRAFT 7/16/2023
Deep Pipelining
• 10-20 stages typical
• Number of stages limited by:
• Pipeline
hazards
• Sequencing overhead
• Power
• Cost

Branch Prediction
• Guess whether branch will be taken
• Backward branches are usually taken (loops)
• Consider history to improve guess
• Good prediction reduces fraction of
branches requiring a flush

3
DRAFT 7/16/2023
Branch Prediction
• Ideal pipelined processor: CPI = 1
• Branch misprediction increases CPI
• Static branch prediction:
• Check direction of branch (forward or backward)
• If backward, predict taken
• Else, predict not taken
• Dynamic branch prediction:
• Keep history of last several hundred (or thousand)
branches in branch target buffer, record:
– Branch destination
– Whether branch was taken

Branch Prediction Example

MOV R1, #0 ; R1 = sum
MOV R0, #0 ; R0 = i
FOR ; for (i=0; i<10; i=i+1)

CMP R0, #10
BGE DONE
ADD R1, R1, R0 ; sum = sum + i
ADD R0, R0, #1
B FOR
DONE

10
4
DRAFT 7/16/2023
1-Bit Branch Predictor

• Remembers whether branch was taken the
last time and does the same thing
• Mispredicts first and last branch of loop

11
2-Bit Branch Predictor
strongly weakly weakly strongly

taken taken not taken not taken
taken
taken taken taken
predict predict predict predict
taken taken not taken not taken
taken taken taken taken
Only mispredicts last branch of loop

12
5
DRAFT 7/16/2023
Superscalar
• Multiple copies of datapath execute
multiple instructions at once
• Dependencies make it tricky to issue
multiple instructions at once
CLK CLK CLK CLK
CLK
PC RD A1
A A2
A3 RD1
A1
ALUs
RD4 RD1
A4
Instruction A5 Register A2 RD2
A6 File RD2
Memory RD5 Data
WD3 Memory
WD6
W D1
W D2

13
Superscalar Example
Ideal IPC: 2
Actual IPC: 2
1 2 3 4 5 6 7 8
Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF R1 DM RF
IM
ADD R9
ADD R9, R1, R2 R2 +
R1
SUB R10
SUB R10, R1, R3 R3 -
RF R3 DM RF
IM
AND R11
AND R11, R3, R4 R4 &
R1
ORR R12
ORR R12, R1, R5 R5 |
RF R0 DM RF
IM
STR R5
STR R5, [R0, #80] 80
+

14
6
DRAFT 7/16/2023
Superscalar with Dependencies

Ideal IPC: 2
Actual IPC: 6/5 = 1.2
1 2 3 4 5 6 7 8 9
Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF DM RF
IM
R8 R8
ADD R9
ADD R9, R8, R1 R1 R1 +
RF R2 RF R2 DM RF
IM
SUB R8
SUB R8, R2, R3 R3 R3 -
Stall R4
AND AND R10
AND R10, R4, R8 R8 &
RF R5 DM RF
IM IM
ORR ORR R11
ORR R11, R5, R6 R6 |
R11
STR R7
STR R7, [R11, #80] 80 +
RF DM RF
IM

15
Advanced Architecture Techniques

• Multithreading
• Wordprocessor: thread for typing, spell
checking, printing
• Multiprocessors
• Multiple processors (cores) on a single chip

16
7
DRAFT 7/16/2023
Threading: Definitions
• Process: program running on a computer
• Multiple
processes can run at once: e.g.,
surfing Web, playing music, writing a paper
• Thread: part of a program
• Each process has multiple threads: e.g., a
word processor may have threads for typing,
spell checking, printing

17
Threads in Conventional Processor

• One thread runs at once
• When one thread stalls (for example,
waiting for memory):
• Architectural state of that thread stored
• Architectural state of waiting thread loaded into
processor and it runs
• Called context switching
• Appears to user like all threads running

simultaneously

18
8
DRAFT 7/16/2023
Multithreading
• Multiple copies of architectural state

• Multiple threads active at once:
• When one thread stalls, another runs immediately
• If one thread can’t keep all execution units busy, another
thread can use them
• Does not increase instruction-level parallelism
(ILP) of single thread, but increases throughput
Intel calls this “hyperthreading”

19
Multiprocessors
• Multiple processors (cores) with a method of

communication between them
• Types:
• Homogeneous: multiple cores with shared main memory
• Heterogeneous: separate cores for different tasks (for
example, DSP and CPU in cell phone)
• Clusters: each core has own memory system

20
9
DRAFT 7/16/2023
Topics Covered
• Sarah Harris Chapter 7:

• 7.1 Introduction
• 7.2 Performance Analysis
• 7.3 Single Cycle Processor
• 7.4 Multicycle Processor
• 7.5 Pipelining
• 7.7 – Selected Topics:
• 7.7.1 Deep Pipelining (Super pipeline)
• 7.7.3 Branch Prediction (Branch Prediction)
• 7.7.4 Super Scalar Processor (Super Scalar Processing)
• 7.7.7 Multi-threading (Multi core computing)
• 7.7.8 Multi Processor (Multi core computing)

21
EEE Micro Architecture:

Advanced Architectures,
415 Memory
Lecture 7.1
Week 7

22
10
DRAFT 7/16/2023
Topics
• Introduction
• Memory System Performance Analysis
• Caches
• Virtual Memory
• Memory-Mapped I/O
• Summary

23
Reference
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)
• Paging - Harris 8.4.1-8.4.2
• Translation Lookaside Buffer- Harris 8.4.3
• Direct Memory Access - Zhu 19.1 (just the concept)

24
11
DRAFT 7/16/2023
Introduction
Computer performance depends on:
– Processor performance
– Memory system performance
Memory Interface
CLK CLK
MemWrite WE
Address ReadData
Processor Memory
WriteData

25
Processor-Memory Gap
In prior chapters, assumed memory access takes 1 clock
cycle – but hasn’t been true since the 1980’s

26
12
DRAFT 7/16/2023
Memory System Challenge

• Make memory system appear as fast as processor
• Use hierarchy of memories
• Ideal memory:
– Fast
– Cheap (inexpensive)
– Large (capacity)
But can only choose two!

27
Memory Hierarchy
Access Bandwidth
Technology Price / GB
Time (ns) (GB/s)
SRAM $10,000 1 25+

Cache
DRAM $10 10 - 50 10
Speed
Main Memory
SSD $1 100,000 0.5

HDD $0.1 10,000,000 0.1
Virtual Memory
Capacity

28
13
DRAFT 7/16/2023
Locality
Exploit locality to make memory accesses fast
• Temporal Locality:
• Locality in time
• If data used recently, likely to use it again soon
• How to exploit: keep recently accessed data in higher levels of
memory hierarchy
• Spatial Locality:
• Locality in space
• If data used recently, likely to use nearby data soon
• How to exploit: when access data, bring nearby data into higher
levels of memory hierarchy too

29
Memory Performance
• Hit: data found in that level of memory hierarchy
• Miss: data not found (must go to next level)
Hit Rate = # hits / # memory accesses
= 1 – Miss Rate
Miss Rate = # misses / # memory accesses
= 1 – Hit Rate
• Average memory access time (AMAT): average time for
processor to access data
AMAT = tcache + MRcache[tMM + MRMM(tVM)]

30
14
DRAFT 7/16/2023
Memory Performance Example 1

• A program has 2,000 loads and stores
• 1,250 of these data values in cache
• Rest supplied by other levels of memory hierarchy
• What are the hit and miss rates for the cache?
Hit Rate = 1250/2000 = 0.625

Miss Rate = 750/2000 = 0.375 = 1 – Hit Rate

31
Memory Performance Example 2

• Suppose processor has 2 levels of hierarchy: cache
and main memory
• tcache = 1 cycle, tMM = 100 cycles
• What is the AMAT of the program from Example 1?
AMAT = tcache + MRcache(tMM)

= [1 + 0.375(100)] cycles
= 38.5 cycles

32
15
DRAFT 7/16/2023
Amdahl’s Law
• Amdahl’s Law: the effort spent
increasing the performance of a
subsystem is wasted unless the
subsystem affects a large percentage
of overall performance
• Slatency is the theoretical speedup of the execution of the whole task;

• s is the speedup of the part of the task that benefits from improved
system resources;
• p is the proportion of execution time that the part benefiting from
Gene Myron Amdahl (1922–2015)
improved resources originally occupied.
33
EEE Memory:
Cache
415 Lecture 7.2

Week 7

34
16
DRAFT 7/16/2023
Cache
• Highest level in memory hierarchy
• Fast (typically ~ 1 cycle access time) Cache
• Ideally supplies most data to processor
Speed
Main Memory
• Usually holds most recently accessed data

Virtual Memory
Capacity

35
Cache Design Questions

• What data is held in the cache?
• How is data found?
• What data is replaced?
Focus on data loads, but stores follow same principles

36
17
DRAFT 7/16/2023
What data is held in the cache?
• Ideally, cache anticipates needed data and puts it in cache

• But impossible to predict future
• Use past to predict future – temporal and spatial locality:
• Temporal locality: copy newly accessed data into cache
• Spatial locality: copy neighboring data into cache too

37
Cache Operation
• When the cache client (a CPU, web browser, operating system)
needs to access data presumed to exist in the backing store, it first
checks the cache. If an entry can be found with a tag matching that
of the desired data, the data in the entry is used instead. This
situation is known as a cache hit.
• When the cache is checked and found not to contain any entry with
the desired tag, is known as a cache miss. This requires a more
expensive access of data from the backing store. Once the
requested data is retrieved, it is typically copied into the cache, ready
for the next access.
• During a cache miss, some other previously existing cache entry is
removed in order to make room for the newly retrieved data.
38
18
DRAFT 7/16/2023
Cache Policy
When a system writes data to cache, it must at some point write
that data to the backing store as well. The timing of this write is
controlled by what is known as the write policy.
There are two basic writing approaches:[3]

•Write-through: write is done synchronously both to the cache and
to the backing store.
•Write-back (also called write-behind): initially, writing is done only
to the cache. The write to the backing store is postponed until the
modified content is about to be replaced by another cache block.

39
Cache Policy: Write-miss Policy

Since no data is returned to the requester on write operations, a
decision needs to be made on write misses, whether or not data
would be loaded into the cache. This is defined by these two
approaches:
•Write allocate (also called fetch on write): data at the missed-write
location is loaded to cache, followed by a write-hit operation. In this
approach, write misses are similar to read misses.
•No-write allocate (also called write-no-allocate or write around):
data at the missed-write location is not loaded to cache, and is
written directly to the backing store. In this approach, data is
loaded into the cache on read misses only.
40
19
DRAFT 7/16/2023
Both write-through and write-back policies can use either of write

allocate or no-write allocate write-miss policies, but usually they are
paired in this way:
• A write-back cache uses write allocate, hoping for subsequent writes
(or even reads) to the same location, which is now cached.
• A write-through cache uses no-write allocate. Here, subsequent
writes have no advantage, since they still need to be written directly
to the backing store.

41
Write-Back Write Through

42
20
DRAFT 7/16/2023
Cache Terminology
• Capacity (C):
• number of data bytes in cache
• Block size (b):
• bytes of data brought into cache at once
• Number of blocks (B = C/b):
• number of blocks in cache: B = C/b
• Degree of associativity (N):
• number of blocks in a set
• Number of sets (S = B/N):
• each memory address maps to exactly one cache set

43
How is data found?

• Cache organized into S sets
• Each memory address maps to exactly one set
• Caches categorized by # of blocks in a set:
• Direct mapped: 1 block per set
• N-way set associative: N blocks per set
• Fully associative: all cache blocks in 1 set
• Examine each organization for a cache with:
• Capacity (C = 8 words)
• Block size (b = 1 word)
• So, number of blocks (B = 8)

44
21
DRAFT 7/16/2023
Example Cache Parameters

• C = 8 words (capacity)
• b = 1 word (block size)
• So, B = 8 (# of blocks)
Ridiculously small, but will illustrate organizations

45
Direct Map
• 32-bit byte address. 2 LSBs same for words
• 30-bit word Address
• If number of sets is N, log2N bits represents the set
• Remaining bits represent the Tag
• For the Example: 27-bit Tag, 3 bit Set

46
22
DRAFT 7/16/2023

47
Direct Mapped Cache

48
23
DRAFT 7/16/2023
DirectCache
Direct Mapped Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data
8-entry x
(1+27+32)-bit
SRAM
27 32
Hit Data

49
Direct Mapped Cache Performance

Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
ARM Assembly Code 0 Set 6 (110)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
LOOP CMP R0, #0 1 00...00 mem[0x00...0C] Set 3 (011)
BEQ DONE 1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
LDR R2, [R1, #4] 0 Set 0 (000)
LDR R3, [R1, #12]
LDR R4, [R1, #8] Miss Rate = ?
SUB R0, R0, #1
B LOOP
DONE

50
24
DRAFT 7/16/2023

Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
LOOP CMP R0, #0 1 00...00 mem[0x00...0C] Set 3 (011)
BEQ DONE 1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
LDR R2, [R1, #4] 0 Set 0 (000)
LDR R3, [R1, #12]
LDR R4, [R1, #8] Miss Rate = 3/15
SUB R0, R0, #1 = 20%
B LOOP Temporal Locality
DONE
Compulsory Misses

51
Direct Mapped Cache: Conflict

Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
0 Set 3 (011)
LOOP CMP R0, #0
0 Set 2 (010)
BEQ DONE 1 00...00
mem[0x00...04] Set 1 (001)
mem[0x00...24]
LDR R2, [R1, #0x4] 0 Set 0 (000)
LDR R3, [R1, #0x24]
SUB R0, R0, #1
Miss Rate = ?
B LOOP
DONE

52
25
DRAFT 7/16/2023
Direct Mapped Cache: Conflict

Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
0 Set 3 (011)
LOOP CMP R0, #0
0 Set 2 (010)
BEQ DONE 1 00...00
mem[0x00...04] Set 1 (001)
mem[0x00...24]
LDR R2, [R1, #0x4] 0 Set 0 (000)
LDR R3, [R1, #0x24]
SUB R0, R0, #1
Miss Rate = 10/10
B LOOP = 100%
DONE
Conflict Misses

53
N-Way Set Associative Cache

Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data
28 32 28 32
= =
1
Hit1 Hit0 Hit1

32
Hit Data

54
26
DRAFT 7/16/2023
N-Way Set Associative Performance

ARM Assembly Code
MOV R0, #5
MOV R1, #0 Miss Rate = ?
LOOP CMP R0, 0
BEQ DONE
LDR R2, [R1, #0x4]
LDR R3, [R1, #0x24]
SUB R0, R0, #1
B LOOP
Way 1 Way 0
DONE
0 0 Set 3
0 0 Set 2
0 0 Set 1
0 0 Set 0

55
N-Way Set Associative Performance

ARM Assembly Code
MOV R0, #5
MOV R1, #0 Miss Rate = 2/10
LOOP CMP R0, 0
BEQ DONE
= 20%
LDR R2, [R1, #0x4]
Associativity reduces
LDR R3, [R1, #0x24]
SUB R0, R0, #1 conflict misses
B LOOP
Way 1 Way 0
DONE
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0

56
27
DRAFT 7/16/2023
Fully Associative Cache

Any memory can be written in
any Cache block
V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data
Reduces conflict misses

Expensive to build

57
Spatial Locality?
• Increase block size:
• Block size, b = 4 words
• C = 8 words
• Direct mapped (1 block per set)
• Number of blocks, B = 2 (C/b = 8/4 = 2)
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data

58
28
DRAFT 7/16/2023
Cache with Larger Block Size
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data

59

ARM assembly code
MOV R0, #5
Miss Rate = ?
MOV R1, #0
LOOP CMP R0, 0
BEQ DONE
LDR R2, [R1, #4]
LDR R3, [R1, #12]
LDR R4, [R1, #8]
SUB R0, R0, #1
B LOOP
DONE

60
29
DRAFT 7/16/2023

ARM assembly code
MOV R0, #5
Miss Rate = 1/15
MOV R1, #0 = 6.67%
LOOP CMP R0, 0
BEQ DONE
Larger blocks
LDR R2, [R1, #4] reduce compulsory misses
LDR R3, [R1, #12]
LDR R4, [R1, #8]
through spatial locality
Block Byte
SUB R0, R0, #1 Memory Tag Set Offset Offset
00...00 0 11 00
Address
B LOOP 27 2
V Tag Data
DONE 0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data

61
Cache Organization Recap

• Capacity: C
• Block size: b
• Number of blocks in cache: B = C/b
• Number of blocks in a set: N
• Number of sets: S = B/N
Number of Ways Number of Sets
Organization (N) (S = B/N)
Direct Mapped 1 B
N-Way Set Associative 1<N<B B/N
Fully Associative B 1

62
30
DRAFT 7/16/2023
Capacity Misses
• Cache is too small to hold all data of interest at once
• If cache full: program accesses data X & evicts data Y
• Capacity miss when access Y again
• How to choose Y to minimize chance of needing it again?
• Least recently used (LRU) replacement: the least
recently used block in a set evicted

63
Types of Misses
• Compulsory: first time data accessed

• Capacity: cache too small to hold all
data of interest
• Conflict: data of interest maps to same
location in cache
Miss penalty: time it takes to retrieve a block from

lower level of hierarchy

64
31
DRAFT 7/16/2023
LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
0 0 0 Set 1 (01)
0 0 0 Set 0 (00)
LRU = Least Recently Used

65
LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 0 00...010 mem[0x00...24] 1 00...000 mem[0x00...04] Set 1 (01)
0 0 0 Set 0 (00)
(a)
Way 1 Way 0
V U Tag Data V Tag Data

0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 1 00...010 mem[0x00...24] 1 00...101 mem[0x00...54] Set 1 (01)
0 0 0 Set 0 (00)
(b)
LRU = Least Recently Used

66
32
DRAFT 7/16/2023
Cache Summary
• What data is held in the cache?
• Recently used data (temporal locality)
• Nearby data (spatial locality)
• How is data found?
• Set is determined by address of data
• Word within block also determined by address
• In associative caches, data could be in one of
several ways
• What data is replaced?
• Least-recently used way in the set

67
Miss Rate Trends

• Bigger caches reduce capacity misses
• Greater associativity reduces conflict misses
Adapted from Patterson & Hennessy, Computer Architecture: A Quantitative Approach, 2011
68
33
DRAFT 7/16/2023
Miss Rate Trends
• Bigger blocks reduce compulsory misses

• Bigger blocks increase conflict misses

69
Multilevel Caches
• Larger caches have lower miss rates, longer
access times
• Expand memory hierarchy to multiple levels of
caches
• Level 1: small and fast (e.g. 16 KB, 1 cycle)
• Level 2: larger and slower (e.g. 256 KB, 2-6 cycles)
• Most modern PCs have L1, L2, and L3 cache

70
34
DRAFT 7/16/2023
Intel Pentium III Die

71
Multi-Core Intel – Shared Cache

72
35
DRAFT 7/16/2023
EEE Memory:
Cache - Problems
415 Lecture 7.3

Week 7

73
• AMAT = 1 + 0.1 (100) = 11

74
36
DRAFT 7/16/2023
• Each memory access checks the L1 cache.

• When the L1 cache misses (5% of the time), the processor checks the L2 cache.
• When the L2 cache misses (20% of the time), the processor fetches the data
from MM
• AMAT = 1 + (5/100) * 10 + (20/100) * 100 = 2.5

75
AMAT = 1 + 100*m
→ m = (AMAT – 1)/100
→ m = (1.5-1)/100 = 0.5%

76
37
DRAFT 7/16/2023
• 0x14 = 0b00010100
• Word aligned, so ignore 2 LSBs
• Word maps to set 5
• Any address with bits 5,4,3 as

101 will map to this set, eg:
0x34, 0x54, 0x74

77
• 1024 sets require log2(210) = 10 bits.
• Two LSBs are byte offset
• 32-10-2 = 20 bits form tag

78
38
DRAFT 7/16/2023
3*5 = 15
Memory access
0x4 = 00000100,
0xC = 00001100
0x8 = 00001000
1 0…..000 mem[0x00...0C]
1 0…..000 mem[0x00...08]
1 0…..000 mem[0x00...04]

79
• Memory addresses 0x4 and 0x24 both

map to set 1.
• During the initial execution of the loop,

data at address 0x4 is loaded into set 1 of
the cache. Then data at address 0x24 is
loaded into set 1, evicting the data from
address 0x4.
•
Upon the second execution of the loop,
the pattern repeats and the cache must
refetch data at address 0x4, evicting data
from address 0x24. The two addresses
conflict,
• the miss rate is 100%.

80
39
DRAFT 7/16/2023
• Memory addresses 0x4 and 0x24

both map to set 1. Two-way cache
1 0…..000 mem[0x00...02] 1 0…..000 mem[0x00...04]
can accommodate data from both
address.
• During the first loop iteration, the

empty cache misses both address
and loads
both words of data into the two ways
of set 1
• On the next four iterations, the
cache hits. Hence, the miss rate is
2/10 = 20%
• Dr. Sajid Muhaimin Choudhury 81

81
On the first loop iteration,

the cache misses on the
access to memory address
0x4.
This access loads data at
addresses 0x0 through 0xC
into the cache block.
1 00..00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] All subsequent accesses
(as shown for address 0xC)
hit in the cache. Hence, the
miss rate
is 1/15 = 6.67%.
82
40
DRAFT 7/16/2023
• All four store instructions write to the same cache block.

• With a writethrough cache, each store instruction writes a word to main memory,
requiring four main memory writes.
• A write-back policy requires only one main memory access, when the dirty cache block
is evicted.
83
• (a) C = S ×b ×N ×4 bytes
• (b) bits = S×N×[A – log2b – log2S – 2]
• (c) S = 1, N = C/b
• (d) S = C/b

84
41
DRAFT 7/16/2023
• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20

85
• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20

86
42
DRAFT 7/16/2023
• 74 A0 78 38C AC 84 88 8C 7C 34 38 13C 388 18C
(a) direct mapped cache, b = 1 word

(b) fully associative cache, b = 2 words
(c) two-way set associative cache, b = 2 words
(d) direct mapped cache, b = 4 words

87
• 74 A0 78 38C AC 84 88 8C 7C 34 38 13C 388 18C 15(F) 7C 13C

14(E) 78 38
13(D) 74 34
12(C)
11(B) AC
(a) direct mapped cache, b = 1 word
10(A)
Hex Binary Tag Tag (Dec)
74 000001110100 1101 13
9
For repeating sequence
A0 000010100000 1000 8 8 A0
78 000001111000 1110 14 14 reads, 11 misses
38C
7
001110001100 0011 3
AC 000010101100 1011 11
Miss rate = 11/14 6
84 000010000100 0001 1
5
88 000010001000 0010 2
8C 000010001100 0011 3 4
7C
34
000001111100 1111
000000110100 1101
15
13
3 38C 8C 18C
38 000000111000 1110 14 2 88 388
13C 000100111100 1111 15 1 84
388 001110001000 0010 2
18C 0 Muhaimin Choudhury 88
Dr. Sajid
EEE000110001100 0011
415 - Department 3 of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
88
43
DRAFT 7/16/2023
• (a) 1024/8 = 128

• (b) all unique tags
0 00000000
8 00001000
10 00010000
18 00011000
20 00100000
28 00101000
• 100% miss rate
0 00000000
8 00001000 • (c) Increase block size 4
10 00010000 words.
18 00011000
20 00100000
28 00101000
• Miss rate 50%
89
• (b) Each tag is 16 bits.

There are 32Kwords / (2 words / block) = 16K blocks and
each block needs a tag: 16 × 16K = 218 = 256 Kbits of tags
90
44
DRAFT 7/16/2023
• (c) Each cache block requires:

• 2 status bits (V & D),
• 16 bits of tag,
• 2 words = 2*32 data bits,
• With associativity 2, Each set size: 2*(2+16+64) = 2*82 = 164

91
• The design must use enough RAM chips to handle both the total capacity
and the number of bits that must be read on each cycle. For the data, the SRAM
must provide a capacity of 128 KB and must read 64 bits per cycle (one 32-bit
word from each way)
• Thus the design needs at least 4*32KB / (8KB/RAM) = 16 RAM cells to hold the
data and 64 bits / (4 pins/RAM) = 16 RAMs to supply the number of bits. These
are equal, so the design needs exactly 16 RAMs for the data
92
45
DRAFT 7/16/2023
• For the tags, the total capacity is 32 KB, from which 32 bits (two 16-bit tags)
must be read each cycle.
• Therefore, only 4 RAMs are necessary to meet the capacity, but 8 RAMs are
needed to supply 32 bits per cycle. Therefore, the design will need 8 RAMs,
each of which is being used at half capacity
• With 8K sets, the status bits require another 8K × 4-bit RAM. We use a 16K ×
4-bit RAM, using only half of the entries
93

94
46
DRAFT 7/16/2023

95
• (a) The word in memory might be found in two

locations, one in the on-chip cache, and one in the
off-chip cache
(b) For the first-level cache, the number of sets, S = 512 /

4 = 128 sets. Thus, 7 bits of the address are set bits. The
block size is 16 bytes / 4 bytes/word = 4 words, so there
are 2 block offset bits. Thus, the number of tag bits for
the first-level cache is 32 - (7+2+2) = 21 bits.
For the second-level cache, the number of sets is equal to

the number of blocks, S = 256 Ksets.
Thus, 18 bits of the address are set bits. The block size is
16 bytes / 4 bytes/word = 4 words, so there are 2 block
offset bits. Thus, the number of tag bits for the second-
level cache is 32 - (18+2+2) = 10 bits

96
47
DRAFT 7/16/2023
AMAT = tcache + MRcache(tL2cache + MRL2cache tMM)
= ta + (1-A)[tb+(1-B)tm]
When the first-level cache is enabled, the second-level

cache receives only the “hard” accesses, ones that don’t
show enough temporal and spatial locality to hit in the
first-level cache. The “easy” accesses (ones with good
temporal and spatial locality) hit in the first-level
cache, even though they would have also hit in the
second-level cache.
When the first-level cache is disabled, the hit rate goes up

because the second-level cache supplies both the “easy”
accesses and some of the “hard” accesses.

97
• AMAT = tcache + MRcache tMM

• With a cycle time of 1/1 GHz = 1 ns,
AMAT = 1 ns + 0.15(200 ns) = 31 ns
CPI = 31 + 4 = 35 cycles (for a load)

CPI = 31 + 3 = 34 cyles (for a store)
98
48
DRAFT 7/16/2023
• Average CPI = (0.11 + 0.02)(3) + (0.52)(4) + (0.1)(34) + (0.25)(35) = 14.6
• Average CPI = 14.6 + 0.1(200) = 34.6

99
Summary
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)
In the interest of brevity, the following topics will not be

covered: Multicore computing, message passing, shared
memory, cache-coherence protocol, memory consistency, paging

100
49
DRAFT 7/16/2023
Summary: Microprocessor and Embedded System

Microprocessor Part:
• Patterson Ch-1, Lecture Slides: Fundamentals of microprocessor and computer design, Intro to
CISC and RISC, complexity, metrics, and benchmark;
• Harris Ch-6: processor data path, architecture, microarchitecture, Instruction Set Architecture,
Assembly language programming of Arm based microprocessors (jump, call-return, stack, push and
pop, shift, rotate, logic instructions),
• Harris Ch-7: Instruction-Level Parallelism, pipelining, pipelining hazards and data dependency, branch
prediction, exceptions and limits, super-pipelined vs superscalar processing; Multicore computing,
• Harris Ch-8, Lecture Slides: Memory hierarchy and management, cache, cache policies, multi-level
cache, cache performance;
• In the interest of time, the following topics could not be covered in class and (may be excluded from
syllabus): message passing, shared memory, cache-coherence protocol, memory consistency, virtual
memory, paging, Vector Processor, Graphics Processing Unit, IP Blocks, Single Instruction Multiple
Data and SoC with microprocessors.

101
50

EEE415-Week07-Micro Architecture and Memory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EEE415-Week07-Micro Architecture and Memory

Uploaded by

Copyright:

Available Formats

DRAFT 7/16/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Advanced Architectures,

415 Lecture 7.1

Dr. Sajid Muhaimin Choudhury, Assistant Professor

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:

Dr. Sajid Muhaimin Choudhury, Assistant Professor

Dr. Sajid Muhaimin Choudhury 5

Advanced Microarchitecture EEE 415

Dr. Sajid Muhaimin Choudhury 6

Dr. Sajid Muhaimin Choudhury 7

Dr. Sajid Muhaimin Choudhury 8

Dr. Sajid Muhaimin Choudhury 9

Branch Prediction Example

FOR ; for (i=0; i<10; i=i+1)

Dr. Sajid Muhaimin Choudhury 10

1-Bit Branch Predictor

Dr. Sajid Muhaimin Choudhury 11

2-Bit Branch Predictor

strongly weakly weakly strongly

Only mispredicts last branch of loop

Dr. Sajid Muhaimin Choudhury 12

Dr. Sajid Muhaimin Choudhury 13

Dr. Sajid Muhaimin Choudhury 14

Superscalar with Dependencies

Dr. Sajid Muhaimin Choudhury 15

Advanced Architecture Techniques

Dr. Sajid Muhaimin Choudhury 16

Dr. Sajid Muhaimin Choudhury 17

Threads in Conventional Processor

• Appears to user like all threads running

Dr. Sajid Muhaimin Choudhury 18

• Multiple copies of architectural state

Dr. Sajid Muhaimin Choudhury 19

• Multiple processors (cores) with a method of

Dr. Sajid Muhaimin Choudhury 20

• Sarah Harris Chapter 7:

Dr. Sajid Muhaimin Choudhury 21

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:

Dr. Sajid Muhaimin Choudhury, Assistant Professor

Dr. Sajid Muhaimin Choudhury 23

Dr. Sajid Muhaimin Choudhury 24

Dr. Sajid Muhaimin Choudhury 25

Dr. Sajid Muhaimin Choudhury 26

Memory System Challenge

But can only choose two!

Dr. Sajid Muhaimin Choudhury 27

SRAM $10,000 1 25+

SSD $1 100,000 0.5

Dr. Sajid Muhaimin Choudhury 28

Dr. Sajid Muhaimin Choudhury 29

Dr. Sajid Muhaimin Choudhury 30

Memory Performance Example 1

Hit Rate = 1250/2000 = 0.625

Dr. Sajid Muhaimin Choudhury 31

Memory Performance Example 2

AMAT = tcache + MRcache(tMM)

Dr. Sajid Muhaimin Choudhury 32

• Slatency is the theoretical speedup of the execution of the whole task;

EEE 415 - Microprocessors and Embedded Systems

415 Lecture 7.2

Dr. Sajid Muhaimin Choudhury, Assistant Professor