Professional Documents
Culture Documents
EEE415-Week07-Micro Architecture and Memory
EEE415-Week07-Micro Architecture and Memory
415 Memory
Lecture 7.1
Week 7
1
DRAFT 7/16/2023
Advanced Microarchitecture
• Deep Pipelining
• Micro-operations
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• Multithreading
• Multiprocessors
2
DRAFT 7/16/2023
Deep Pipelining
• 10-20 stages typical
• Number of stages limited by:
• Pipeline
hazards
• Sequencing overhead
• Power
• Cost
Branch Prediction
• Guess whether branch will be taken
• Backward branches are usually taken (loops)
• Consider history to improve guess
• Good prediction reduces fraction of
branches requiring a flush
3
DRAFT 7/16/2023
Branch Prediction
• Ideal pipelined processor: CPI = 1
• Branch misprediction increases CPI
• Static branch prediction:
• Check direction of branch (forward or backward)
• If backward, predict taken
• Else, predict not taken
• Dynamic branch prediction:
• Keep history of last several hundred (or thousand)
branches in branch target buffer, record:
– Branch destination
– Whether branch was taken
DONE
10
4
DRAFT 7/16/2023
11
12
5
DRAFT 7/16/2023
Superscalar
• Multiple copies of datapath execute
multiple instructions at once
• Dependencies make it tricky to issue
multiple instructions at once
CLK CLK CLK CLK 2 way
CLK hardware bere gese
PC
A
RD A1
A2 2 IPC
A3 RD1
A1
ALUs
RD4 RD1
A4
Instruction A5 Register A2 RD2
A6 File RD2
Memory RD5 Data
WD3 Memory
WD6
W D1
W D2
13
Superscalar Example
Ideal IPC: 2
Actual IPC: 2
1 2 3 4 5 6 7 8
Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF R1 DM RF
IM
ADD R9
ADD R9, R1, R2 R2 +
R1
SUB R10
SUB R10, R1, R3 R3 -
RF R3 DM RF
IM
AND R11
AND R11, R3, R4 R4 &
R1
ORR R12
ORR R12, R1, R5 R5 |
RF R0 DM RF
IM
STR R5
STR R5, [R0, #80] 80
+
14
6
DRAFT 7/16/2023
Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF DM RF
IM
R8 R8
ADD R9
ADD R9, R8, R1 R1 R1 +
RF R2 RF R2 DM RF
IM
SUB R8
SUB R8, R2, R3 R3 R3 -
Stall R4
AND AND R10
AND R10, R4, R8 R8 &
RF R5 DM RF
IM IM
ORR ORR R11
ORR R11, R5, R6 R6 |
R11
STR R7
STR R7, [R11, #80] 80 +
RF DM RF
IM
15
16
7
DRAFT 7/16/2023
Threading: Definitions
• Process: program running on a computer
• Multiple
processes can run at once: e.g.,
surfing Web, playing music, writing a paper
• Thread: part of a program
• Each process has multiple threads: e.g., a
word processor may have threads for typing,
spell checking, printing
17
18
8
DRAFT 7/16/2023
Multithreading
19
Multiprocessors
20
9
DRAFT 7/16/2023
Topics Covered
21
415 Memory
Lecture 7.1
Week 7
22
10
DRAFT 7/16/2023
Topics
• Introduction
• Memory System Performance Analysis
• Caches
• Virtual Memory
• Memory-Mapped I/O
• Summary
23
Reference
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)
• Paging - Harris 8.4.1-8.4.2
• Translation Lookaside Buffer- Harris 8.4.3
• Direct Memory Access - Zhu 19.1 (just the concept)
24
11
DRAFT 7/16/2023
Introduction
Computer performance depends on:
– Processor performance infinite number of registers nei, so memoryr
– Memory system performance shathe interact kora lage and processor read
memory or write memory kore in the address.
Memory Interface So, processor and memory 2tar uporei depend
kore
CLK CLK
MemWrite WE
Address ReadData
Processor Memory
WriteData
25
Processor-Memory Gap
In prior chapters, assumed memory access takes 1 clock
cycle – but hasn’t been true since the 1980’s
26
12
DRAFT 7/16/2023
27
Memory Hierarchy processor directly cache theke data ney. main memory hocche ram ar
hard drive or virtual memory can be both SSD or HDD
Access Bandwidth
Technology Price / GB
Time (ns) (GB/s)
CPU jokhon run kore RAM theke cache e niye run kore
speed er shathe shathe cache erDr.price
EEE 415 - Department of EEE, BUET w beshi Choudhury 28
Sajid Muhaimin
Digital Design and Computer Architecture: ARM® Edition © 2015
28
13
DRAFT 7/16/2023
Locality
Exploit locality to make memory accesses fast
• Temporal Locality:
• Locality in time
• If data used recently, likely to use it again soon
• How to exploit: keep recently accessed data in higher levels of
memory hierarchy
• Spatial Locality:
• Locality in space
• If data used recently, likely to use nearby data soon
• How to exploit: when access data, bring nearby data into higher
levels of memory hierarchy too
29
EEE 415 - Department of EEE, BUET ram ke main memeory Dr. Sajid Muhaimin
hishebeChoudhuryconsider30kora hoyeche,
Digital Design and Computer Architecture: ARM® Edition © 2015
main memory te miss kew consider kora hoy
30
31
= [1 + 0.375(100)] cycles
= 38.5 cycles
32
15
DRAFT 7/16/2023
Amdahl’s Law
• Amdahl’s Law: the effort spent
increasing the performance of a
subsystem is wasted unless the
subsystem affects a large percentage
of overall performance
33
EEE Memory:
Cache
34
16
DRAFT 7/16/2023
Cache
• Highest level in memory hierarchy
• Fast (typically ~ 1 cycle access time) Cache
Speed
Main Memory
35
36
17
DRAFT 7/16/2023
37
Cache Operation
• When the cache client (a CPU, web browser, operating system)
needs to access data presumed to exist in the backing store, it first
checks the cache. If an entry can be found with a tag matching that
of the desired data, the data in the entry is used instead. This
situation is known as a cache hit.
• When the cache is checked and found not to contain any entry with
the desired tag, is known as a cache miss. This requires a more
expensive access of data from the backing store. Once the
requested data is retrieved, it is typically copied into the cache, ready
for the next access.
• During a cache miss, some other previously existing cache entry is
removed in order to make room for the newly retrieved data.
Dr. Sajid Muhaimin Choudhury 38
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
38
18
DRAFT 7/16/2023
Cache Policy
When a system writes data to cache, it must at some point write
that data to the backing store as well. The timing of this write is
controlled by what is known as the write policy.
39
40
19
DRAFT 7/16/2023
41
42
20
DRAFT 7/16/2023
Cache Terminology
• Capacity (C):
• number of data bytes in cache
• Block size (b):
• bytes of data brought into cache at once
• Number of blocks (B = C/b):
• number of blocks in cache: B = C/b
• Degree of associativity (N):
• number of blocks in a set
• Number of sets (S = B/N):
• each memory address maps to exactly one cache set
43
44
21
DRAFT 7/16/2023
45
Direct Map
• 32-bit byte address. 2 LSBs same for words
• 30-bit word Address
• If number of sets is N, log2N bits represents the set
• Remaining bits represent the Tag
46
22
DRAFT 7/16/2023
47
48
23
DRAFT 7/16/2023
DirectCache
Direct Mapped Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data
8-entry x
(1+27+32)-bit
SRAM
27 32
Hit Data
49
50
24
DRAFT 7/16/2023
51
52
25
DRAFT 7/16/2023
53
28 32 28 32
= =
1
Hit Data
54
26
DRAFT 7/16/2023
55
56
27
DRAFT 7/16/2023
V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data
57
Spatial Locality?
• Increase block size:
• Block size, b = 4 words
• C = 8 words
• Direct mapped (1 block per set)
• Number of blocks, B = 2 (C/b = 8/4 = 2)
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data
58
28
DRAFT 7/16/2023
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data
59
60
29
DRAFT 7/16/2023
11
10
01
00
32
=
Hit Data
61
62
30
DRAFT 7/16/2023
Capacity Misses
• Cache is too small to hold all data of interest at once
• If cache full: program accesses data X & evicts data Y
• Capacity miss when access Y again
• How to choose Y to minimize chance of needing it again?
• Least recently used (LRU) replacement: the least
recently used block in a set evicted
63
Types of Misses
64
31
DRAFT 7/16/2023
LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
0 0 0 Set 1 (01)
0 0 0 Set 0 (00)
65
LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 0 00...010 mem[0x00...24] 1 00...000 mem[0x00...04] Set 1 (01)
0 0 0 Set 0 (00)
(a)
Way 1 Way 0
66
32
DRAFT 7/16/2023
Cache Summary
• What data is held in the cache?
• Recently used data (temporal locality)
• Nearby data (spatial locality)
• How is data found?
• Set is determined by address of data
• Word within block also determined by address
• In associative caches, data could be in one of
several ways
• What data is replaced?
• Least-recently used way in the set
67
Adapted from Patterson & Hennessy, Computer Architecture: A Quantitative Approach, 2011
Dr. Sajid Muhaimin Choudhury 68
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
68
33
DRAFT 7/16/2023
69
Multilevel Caches
• Larger caches have lower miss rates, longer
access times
• Expand memory hierarchy to multiple levels of
caches
• Level 1: small and fast (e.g. 16 KB, 1 cycle)
• Level 2: larger and slower (e.g. 256 KB, 2-6 cycles)
• Most modern PCs have L1, L2, and L3 cache
70
34
DRAFT 7/16/2023
71
72
35
DRAFT 7/16/2023
EEE Memory:
Cache - Problems
73
74
36
DRAFT 7/16/2023
75
AMAT = 1 + 100*m
→ m = (AMAT – 1)/100
→ m = (1.5-1)/100 = 0.5%
76
37
DRAFT 7/16/2023
• 0x14 = 0b00010100
77
78
38
DRAFT 7/16/2023
3*5 = 15
Memory access
0x4 = 00000100,
0xC = 00001100
0x8 = 00001000
1 0…..000 mem[0x00...0C]
1 0…..000 mem[0x00...08]
1 0…..000 mem[0x00...04]
79
80
39
DRAFT 7/16/2023
81
82
40
DRAFT 7/16/2023
83
• (a) C = S ×b ×N ×4 bytes
• (b) bits = S×N×[A – log2b – log2S – 2]
• (c) S = 1, N = C/b
• (d) S = C/b
84
41
DRAFT 7/16/2023
• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20
85
• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20
86
42
DRAFT 7/16/2023
87
88
43
DRAFT 7/16/2023
89
90
44
DRAFT 7/16/2023
91
• The design must use enough RAM chips to handle both the total capacity
and the number of bits that must be read on each cycle. For the data, the SRAM
must provide a capacity of 128 KB and must read 64 bits per cycle (one 32-bit
word from each way)
• Thus the design needs at least 4*32KB / (8KB/RAM) = 16 RAM cells to hold the
data and 64 bits / (4 pins/RAM) = 16 RAMs to supply the number of bits. These
are equal, so the design needs exactly 16 RAMs for the data
Dr. Sajid Muhaimin Choudhury 92
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
92
45
DRAFT 7/16/2023
• For the tags, the total capacity is 32 KB, from which 32 bits (two 16-bit tags)
must be read each cycle.
• Therefore, only 4 RAMs are necessary to meet the capacity, but 8 RAMs are
needed to supply 32 bits per cycle. Therefore, the design will need 8 RAMs,
each of which is being used at half capacity
• With 8K sets, the status bits require another 8K × 4-bit RAM. We use a 16K ×
4-bit RAM, using only half of the entries
Dr. Sajid Muhaimin Choudhury 93
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015
93
94
46
DRAFT 7/16/2023
95
96
47
DRAFT 7/16/2023
= ta + (1-A)[tb+(1-B)tm]
97
98
48
DRAFT 7/16/2023
99
Summary
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)
100
49
DRAFT 7/16/2023
• In the interest of time, the following topics could not be covered in class and (may be excluded from
syllabus): message passing, shared memory, cache-coherence protocol, memory consistency, virtual
memory, paging, Vector Processor, Graphics Processing Unit, IP Blocks, Single Instruction Multiple
Data and SoC with microprocessors.
101
50