You are on page 1of 50

DRAFT 7/16/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Advanced Architectures,


Memory

415 Lecture 7.1


Week 7

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:


Advanced Architectures,

415 Memory
Lecture 7.1
Week 7

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

1
DRAFT 7/16/2023

Advanced Microarchitecture
• Deep Pipelining
• Micro-operations
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• Multithreading
• Multiprocessors

Dr. Sajid Muhaimin Choudhury 5


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Advanced Microarchitecture EEE 415


• Deep Pipelining
• Micro-operations
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• Multithreading
• Multiprocessors

Dr. Sajid Muhaimin Choudhury 6


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

2
DRAFT 7/16/2023

Deep Pipelining
• 10-20 stages typical
• Number of stages limited by:
• Pipeline
hazards
• Sequencing overhead
• Power
• Cost

Dr. Sajid Muhaimin Choudhury 7


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Branch Prediction
• Guess whether branch will be taken
• Backward branches are usually taken (loops)
• Consider history to improve guess
• Good prediction reduces fraction of
branches requiring a flush

Dr. Sajid Muhaimin Choudhury 8


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

3
DRAFT 7/16/2023

Branch Prediction
• Ideal pipelined processor: CPI = 1
• Branch misprediction increases CPI
• Static branch prediction:
• Check direction of branch (forward or backward)
• If backward, predict taken
• Else, predict not taken
• Dynamic branch prediction:
• Keep history of last several hundred (or thousand)
branches in branch target buffer, record:
– Branch destination
– Whether branch was taken

Dr. Sajid Muhaimin Choudhury 9


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Branch Prediction Example


MOV R1, #0 ; R1 = sum
MOV R0, #0 ; R0 = i

FOR ; for (i=0; i<10; i=i+1)


CMP R0, #10
BGE DONE
ADD R1, R1, R0 ; sum = sum + i
ADD R0, R0, #1
B FOR

DONE

Dr. Sajid Muhaimin Choudhury 10


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

10

4
DRAFT 7/16/2023

1-Bit Branch Predictor


• Remembers whether branch was taken the
last time and does the same thing
• Mispredicts first and last branch of loop

Dr. Sajid Muhaimin Choudhury 11


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

11

2-Bit Branch Predictor

strongly weakly weakly strongly


taken taken not taken not taken
taken
taken taken taken
predict predict predict predict
taken taken not taken not taken
taken taken taken taken

Only mispredicts last branch of loop

Dr. Sajid Muhaimin Choudhury 12


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

12

5
DRAFT 7/16/2023

Superscalar
• Multiple copies of datapath execute
multiple instructions at once
• Dependencies make it tricky to issue
multiple instructions at once
CLK CLK CLK CLK

CLK

PC RD A1
A A2
A3 RD1
A1

ALUs
RD4 RD1
A4
Instruction A5 Register A2 RD2
A6 File RD2
Memory RD5 Data
WD3 Memory
WD6
W D1
W D2

Dr. Sajid Muhaimin Choudhury 13


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

13

Superscalar Example
Ideal IPC: 2
Actual IPC: 2
1 2 3 4 5 6 7 8

Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF R1 DM RF
IM
ADD R9
ADD R9, R1, R2 R2 +

R1
SUB R10
SUB R10, R1, R3 R3 -
RF R3 DM RF
IM
AND R11
AND R11, R3, R4 R4 &

R1
ORR R12
ORR R12, R1, R5 R5 |
RF R0 DM RF
IM
STR R5
STR R5, [R0, #80] 80
+

Dr. Sajid Muhaimin Choudhury 14


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

14

6
DRAFT 7/16/2023

Superscalar with Dependencies


Ideal IPC: 2
Actual IPC: 6/5 = 1.2
1 2 3 4 5 6 7 8 9

Time (cycles)
R0
LDR R8
LDR R8, [R0, #40] 40 +
RF DM RF
IM

R8 R8
ADD R9
ADD R9, R8, R1 R1 R1 +
RF R2 RF R2 DM RF
IM
SUB R8
SUB R8, R2, R3 R3 R3 -

Stall R4
AND AND R10
AND R10, R4, R8 R8 &
RF R5 DM RF
IM IM
ORR ORR R11
ORR R11, R5, R6 R6 |

R11
STR R7
STR R7, [R11, #80] 80 +
RF DM RF
IM

Dr. Sajid Muhaimin Choudhury 15


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

15

Advanced Architecture Techniques


• Multithreading
• Wordprocessor: thread for typing, spell
checking, printing
• Multiprocessors
• Multiple processors (cores) on a single chip

Dr. Sajid Muhaimin Choudhury 16


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

16

7
DRAFT 7/16/2023

Threading: Definitions
• Process: program running on a computer
• Multiple
processes can run at once: e.g.,
surfing Web, playing music, writing a paper
• Thread: part of a program
• Each process has multiple threads: e.g., a
word processor may have threads for typing,
spell checking, printing

Dr. Sajid Muhaimin Choudhury 17


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

17

Threads in Conventional Processor


• One thread runs at once
• When one thread stalls (for example,
waiting for memory):
• Architectural state of that thread stored
• Architectural state of waiting thread loaded into
processor and it runs
• Called context switching

• Appears to user like all threads running


simultaneously

Dr. Sajid Muhaimin Choudhury 18


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

18

8
DRAFT 7/16/2023

Multithreading

• Multiple copies of architectural state


• Multiple threads active at once:
• When one thread stalls, another runs immediately
• If one thread can’t keep all execution units busy, another
thread can use them
• Does not increase instruction-level parallelism
(ILP) of single thread, but increases throughput
Intel calls this “hyperthreading”

Dr. Sajid Muhaimin Choudhury 19


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

19

Multiprocessors

• Multiple processors (cores) with a method of


communication between them
• Types:
• Homogeneous: multiple cores with shared main memory
• Heterogeneous: separate cores for different tasks (for
example, DSP and CPU in cell phone)
• Clusters: each core has own memory system

Dr. Sajid Muhaimin Choudhury 20


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

20

9
DRAFT 7/16/2023

Topics Covered

• Sarah Harris Chapter 7:


• 7.1 Introduction
• 7.2 Performance Analysis
• 7.3 Single Cycle Processor
• 7.4 Multicycle Processor
• 7.5 Pipelining
• 7.7 – Selected Topics:
• 7.7.1 Deep Pipelining (Super pipeline)
• 7.7.3 Branch Prediction (Branch Prediction)
• 7.7.4 Super Scalar Processor (Super Scalar Processing)
• 7.7.7 Multi-threading (Multi core computing)
• 7.7.8 Multi Processor (Multi core computing)

Dr. Sajid Muhaimin Choudhury 21


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

21

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:


Advanced Architectures,

415 Memory
Lecture 7.1
Week 7

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

22

10
DRAFT 7/16/2023

Topics

• Introduction
• Memory System Performance Analysis
• Caches
• Virtual Memory
• Memory-Mapped I/O
• Summary

Dr. Sajid Muhaimin Choudhury 23


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

23

Reference
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)
• Paging - Harris 8.4.1-8.4.2
• Translation Lookaside Buffer- Harris 8.4.3
• Direct Memory Access - Zhu 19.1 (just the concept)

Dr. Sajid Muhaimin Choudhury 24


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

24

11
DRAFT 7/16/2023

Introduction
Computer performance depends on:
– Processor performance
– Memory system performance
Memory Interface
CLK CLK

MemWrite WE
Address ReadData
Processor Memory
WriteData

Dr. Sajid Muhaimin Choudhury 25


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

25

Processor-Memory Gap
In prior chapters, assumed memory access takes 1 clock
cycle – but hasn’t been true since the 1980’s

Dr. Sajid Muhaimin Choudhury 26


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

26

12
DRAFT 7/16/2023

Memory System Challenge


• Make memory system appear as fast as processor
• Use hierarchy of memories
• Ideal memory:
– Fast
– Cheap (inexpensive)
– Large (capacity)

But can only choose two!

Dr. Sajid Muhaimin Choudhury 27


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

27

Memory Hierarchy

Access Bandwidth
Technology Price / GB
Time (ns) (GB/s)

SRAM $10,000 1 25+


Cache

DRAM $10 10 - 50 10
Speed

Main Memory

SSD $1 100,000 0.5


HDD $0.1 10,000,000 0.1
Virtual Memory
Capacity

Dr. Sajid Muhaimin Choudhury 28


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

28

13
DRAFT 7/16/2023

Locality
Exploit locality to make memory accesses fast
• Temporal Locality:
• Locality in time
• If data used recently, likely to use it again soon
• How to exploit: keep recently accessed data in higher levels of
memory hierarchy
• Spatial Locality:
• Locality in space
• If data used recently, likely to use nearby data soon
• How to exploit: when access data, bring nearby data into higher
levels of memory hierarchy too

Dr. Sajid Muhaimin Choudhury 29


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

29

Memory Performance
• Hit: data found in that level of memory hierarchy
• Miss: data not found (must go to next level)
Hit Rate = # hits / # memory accesses
= 1 – Miss Rate
Miss Rate = # misses / # memory accesses
= 1 – Hit Rate
• Average memory access time (AMAT): average time for
processor to access data
AMAT = tcache + MRcache[tMM + MRMM(tVM)]

Dr. Sajid Muhaimin Choudhury 30


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

30

14
DRAFT 7/16/2023

Memory Performance Example 1


• A program has 2,000 loads and stores
• 1,250 of these data values in cache
• Rest supplied by other levels of memory hierarchy
• What are the hit and miss rates for the cache?

Hit Rate = 1250/2000 = 0.625


Miss Rate = 750/2000 = 0.375 = 1 – Hit Rate

Dr. Sajid Muhaimin Choudhury 31


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

31

Memory Performance Example 2


• Suppose processor has 2 levels of hierarchy: cache
and main memory
• tcache = 1 cycle, tMM = 100 cycles
• What is the AMAT of the program from Example 1?

AMAT = tcache + MRcache(tMM)


= [1 + 0.375(100)] cycles
= 38.5 cycles

Dr. Sajid Muhaimin Choudhury 32


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

32

15
DRAFT 7/16/2023

Amdahl’s Law
• Amdahl’s Law: the effort spent
increasing the performance of a
subsystem is wasted unless the
subsystem affects a large percentage
of overall performance

• Slatency is the theoretical speedup of the execution of the whole task;


• s is the speedup of the part of the task that benefits from improved
system resources;
• p is the proportion of execution time that the part benefiting from
Gene Myron Amdahl (1922–2015)
improved resources originally occupied.
Dr. Sajid Muhaimin Choudhury 33
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

33

EEE 415 - Microprocessors and Embedded Systems

EEE Memory:
Cache

415 Lecture 7.2


Week 7

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

34

16
DRAFT 7/16/2023

Cache
• Highest level in memory hierarchy
• Fast (typically ~ 1 cycle access time) Cache

• Ideally supplies most data to processor

Speed
Main Memory

• Usually holds most recently accessed data


Virtual Memory
Capacity

Dr. Sajid Muhaimin Choudhury 35


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

35

Cache Design Questions


• What data is held in the cache?
• How is data found?
• What data is replaced?

Focus on data loads, but stores follow same principles

Dr. Sajid Muhaimin Choudhury 36


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

36

17
DRAFT 7/16/2023

What data is held in the cache?

• Ideally, cache anticipates needed data and puts it in cache


• But impossible to predict future
• Use past to predict future – temporal and spatial locality:
• Temporal locality: copy newly accessed data into cache
• Spatial locality: copy neighboring data into cache too

Dr. Sajid Muhaimin Choudhury 37


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

37

Cache Operation
• When the cache client (a CPU, web browser, operating system)
needs to access data presumed to exist in the backing store, it first
checks the cache. If an entry can be found with a tag matching that
of the desired data, the data in the entry is used instead. This
situation is known as a cache hit.
• When the cache is checked and found not to contain any entry with
the desired tag, is known as a cache miss. This requires a more
expensive access of data from the backing store. Once the
requested data is retrieved, it is typically copied into the cache, ready
for the next access.
• During a cache miss, some other previously existing cache entry is
removed in order to make room for the newly retrieved data.
Dr. Sajid Muhaimin Choudhury 38
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

38

18
DRAFT 7/16/2023

Cache Policy
When a system writes data to cache, it must at some point write
that data to the backing store as well. The timing of this write is
controlled by what is known as the write policy.

There are two basic writing approaches:[3]


•Write-through: write is done synchronously both to the cache and
to the backing store.
•Write-back (also called write-behind): initially, writing is done only
to the cache. The write to the backing store is postponed until the
modified content is about to be replaced by another cache block.

Dr. Sajid Muhaimin Choudhury 39


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

39

Cache Policy: Write-miss Policy


Since no data is returned to the requester on write operations, a
decision needs to be made on write misses, whether or not data
would be loaded into the cache. This is defined by these two
approaches:
•Write allocate (also called fetch on write): data at the missed-write
location is loaded to cache, followed by a write-hit operation. In this
approach, write misses are similar to read misses.
•No-write allocate (also called write-no-allocate or write around):
data at the missed-write location is not loaded to cache, and is
written directly to the backing store. In this approach, data is
loaded into the cache on read misses only.
Dr. Sajid Muhaimin Choudhury 40
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

40

19
DRAFT 7/16/2023

Both write-through and write-back policies can use either of write


allocate or no-write allocate write-miss policies, but usually they are
paired in this way:
• A write-back cache uses write allocate, hoping for subsequent writes
(or even reads) to the same location, which is now cached.
• A write-through cache uses no-write allocate. Here, subsequent
writes have no advantage, since they still need to be written directly
to the backing store.

Dr. Sajid Muhaimin Choudhury 41


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

41

Write-Back Write Through

Dr. Sajid Muhaimin Choudhury 42


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

42

20
DRAFT 7/16/2023

Cache Terminology
• Capacity (C):
• number of data bytes in cache
• Block size (b):
• bytes of data brought into cache at once
• Number of blocks (B = C/b):
• number of blocks in cache: B = C/b
• Degree of associativity (N):
• number of blocks in a set
• Number of sets (S = B/N):
• each memory address maps to exactly one cache set

Dr. Sajid Muhaimin Choudhury 43


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

43

How is data found?


• Cache organized into S sets
• Each memory address maps to exactly one set
• Caches categorized by # of blocks in a set:
• Direct mapped: 1 block per set
• N-way set associative: N blocks per set
• Fully associative: all cache blocks in 1 set
• Examine each organization for a cache with:
• Capacity (C = 8 words)
• Block size (b = 1 word)
• So, number of blocks (B = 8)

Dr. Sajid Muhaimin Choudhury 44


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

44

21
DRAFT 7/16/2023

Example Cache Parameters


• C = 8 words (capacity)
• b = 1 word (block size)
• So, B = 8 (# of blocks)

Ridiculously small, but will illustrate organizations

Dr. Sajid Muhaimin Choudhury 45


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

45

Direct Map
• 32-bit byte address. 2 LSBs same for words
• 30-bit word Address
• If number of sets is N, log2N bits represents the set
• Remaining bits represent the Tag

• For the Example: 27-bit Tag, 3 bit Set

Dr. Sajid Muhaimin Choudhury 46


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

46

22
DRAFT 7/16/2023

Dr. Sajid Muhaimin Choudhury 47


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

47

Direct Mapped Cache

Dr. Sajid Muhaimin Choudhury 48


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

48

23
DRAFT 7/16/2023

DirectCache
Direct Mapped Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data

8-entry x
(1+27+32)-bit
SRAM

27 32

Hit Data

Dr. Sajid Muhaimin Choudhury 49


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

49

Direct Mapped Cache Performance


Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
ARM Assembly Code 0 Set 6 (110)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
LOOP CMP R0, #0 1 00...00 mem[0x00...0C] Set 3 (011)
BEQ DONE 1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
LDR R2, [R1, #4] 0 Set 0 (000)
LDR R3, [R1, #12]
LDR R4, [R1, #8] Miss Rate = ?
SUB R0, R0, #1
B LOOP
DONE

Dr. Sajid Muhaimin Choudhury 50


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

50

24
DRAFT 7/16/2023

Direct Mapped Cache Performance


Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
ARM Assembly Code 0 Set 6 (110)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
LOOP CMP R0, #0 1 00...00 mem[0x00...0C] Set 3 (011)
BEQ DONE 1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
LDR R2, [R1, #4] 0 Set 0 (000)
LDR R3, [R1, #12]
LDR R4, [R1, #8] Miss Rate = 3/15
SUB R0, R0, #1 = 20%
B LOOP Temporal Locality
DONE
Compulsory Misses

Dr. Sajid Muhaimin Choudhury 51


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

51

Direct Mapped Cache: Conflict


Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
ARM Assembly Code 0 Set 6 (110)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
0 Set 3 (011)
LOOP CMP R0, #0
0 Set 2 (010)
BEQ DONE 1 00...00
mem[0x00...04] Set 1 (001)
mem[0x00...24]
LDR R2, [R1, #0x4] 0 Set 0 (000)
LDR R3, [R1, #0x24]
SUB R0, R0, #1
Miss Rate = ?
B LOOP
DONE

Dr. Sajid Muhaimin Choudhury 52


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

52

25
DRAFT 7/16/2023

Direct Mapped Cache: Conflict


Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
ARM Assembly Code 0 Set 6 (110)
MOV R0, #5 0 Set 5 (101)
MOV R1, #0 0 Set 4 (100)
0 Set 3 (011)
LOOP CMP R0, #0
0 Set 2 (010)
BEQ DONE 1 00...00
mem[0x00...04] Set 1 (001)
mem[0x00...24]
LDR R2, [R1, #0x4] 0 Set 0 (000)
LDR R3, [R1, #0x24]
SUB R0, R0, #1
Miss Rate = 10/10
B LOOP = 100%
DONE
Conflict Misses

Dr. Sajid Muhaimin Choudhury 53


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

53

N-Way Set Associative Cache


Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data

28 32 28 32

= =
1

Hit1 Hit0 Hit1


32

Hit Data

Dr. Sajid Muhaimin Choudhury 54


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

54

26
DRAFT 7/16/2023

N-Way Set Associative Performance


ARM Assembly Code
MOV R0, #5
MOV R1, #0 Miss Rate = ?
LOOP CMP R0, 0
BEQ DONE
LDR R2, [R1, #0x4]
LDR R3, [R1, #0x24]
SUB R0, R0, #1
B LOOP
Way 1 Way 0
DONE
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
0 0 Set 1
0 0 Set 0

Dr. Sajid Muhaimin Choudhury 55


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

55

N-Way Set Associative Performance


ARM Assembly Code
MOV R0, #5
MOV R1, #0 Miss Rate = 2/10
LOOP CMP R0, 0
BEQ DONE
= 20%
LDR R2, [R1, #0x4]
Associativity reduces
LDR R3, [R1, #0x24]
SUB R0, R0, #1 conflict misses
B LOOP
Way 1 Way 0
DONE
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0

Dr. Sajid Muhaimin Choudhury 56


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

56

27
DRAFT 7/16/2023

Fully Associative Cache


Any memory can be written in
any Cache block

V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data

Reduces conflict misses


Expensive to build

Dr. Sajid Muhaimin Choudhury 57


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

57

Spatial Locality?
• Increase block size:
• Block size, b = 4 words
• C = 8 words
• Direct mapped (1 block per set)
• Number of blocks, B = 2 (C/b = 8/4 = 2)
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11

10

01

00

32
=

Hit Data

Dr. Sajid Muhaimin Choudhury 58


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

58

28
DRAFT 7/16/2023

Cache with Larger Block Size

Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data

Dr. Sajid Muhaimin Choudhury 59


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

59

Direct Mapped Cache Performance


ARM assembly code
MOV R0, #5
Miss Rate = ?
MOV R1, #0
LOOP CMP R0, 0
BEQ DONE
LDR R2, [R1, #4]
LDR R3, [R1, #12]
LDR R4, [R1, #8]
SUB R0, R0, #1
B LOOP
DONE

Dr. Sajid Muhaimin Choudhury 60


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

60

29
DRAFT 7/16/2023

Direct Mapped Cache Performance


ARM assembly code
MOV R0, #5
Miss Rate = 1/15
MOV R1, #0 = 6.67%
LOOP CMP R0, 0
BEQ DONE
Larger blocks
LDR R2, [R1, #4] reduce compulsory misses
LDR R3, [R1, #12]
LDR R4, [R1, #8]
through spatial locality
Block Byte
SUB R0, R0, #1 Memory Tag Set Offset Offset
00...00 0 11 00
Address
B LOOP 27 2
V Tag Data
DONE 0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32

11

10

01

00
32
=

Hit Data

Dr. Sajid Muhaimin Choudhury 61


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

61

Cache Organization Recap


• Capacity: C
• Block size: b
• Number of blocks in cache: B = C/b
• Number of blocks in a set: N
• Number of sets: S = B/N
Number of Ways Number of Sets
Organization (N) (S = B/N)
Direct Mapped 1 B
N-Way Set Associative 1<N<B B/N
Fully Associative B 1

Dr. Sajid Muhaimin Choudhury 62


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

62

30
DRAFT 7/16/2023

Capacity Misses
• Cache is too small to hold all data of interest at once
• If cache full: program accesses data X & evicts data Y
• Capacity miss when access Y again
• How to choose Y to minimize chance of needing it again?
• Least recently used (LRU) replacement: the least
recently used block in a set evicted

Dr. Sajid Muhaimin Choudhury 63


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

63

Types of Misses

• Compulsory: first time data accessed


• Capacity: cache too small to hold all
data of interest
• Conflict: data of interest maps to same
location in cache

Miss penalty: time it takes to retrieve a block from


lower level of hierarchy

Dr. Sajid Muhaimin Choudhury 64


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

64

31
DRAFT 7/16/2023

LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
0 0 0 Set 1 (01)
0 0 0 Set 0 (00)

LRU = Least Recently Used


Dr. Sajid Muhaimin Choudhury 65
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

65

LRU Replacement
ARM Assembly Code
MOV R0, #0 Way 1 Way 0
LDR R1, [R0, #4]
LDR R2, [R0, #0x24] V U Tag Data V Tag Data
LDR R3, [R0, #0x54] 0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 0 00...010 mem[0x00...24] 1 00...000 mem[0x00...04] Set 1 (01)
0 0 0 Set 0 (00)
(a)
Way 1 Way 0

V U Tag Data V Tag Data


0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 1 00...010 mem[0x00...24] 1 00...101 mem[0x00...54] Set 1 (01)
0 0 0 Set 0 (00)
(b)

LRU = Least Recently Used


Dr. Sajid Muhaimin Choudhury 66
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

66

32
DRAFT 7/16/2023

Cache Summary
• What data is held in the cache?
• Recently used data (temporal locality)
• Nearby data (spatial locality)
• How is data found?
• Set is determined by address of data
• Word within block also determined by address
• In associative caches, data could be in one of
several ways
• What data is replaced?
• Least-recently used way in the set

Dr. Sajid Muhaimin Choudhury 67


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

67

Miss Rate Trends


• Bigger caches reduce capacity misses
• Greater associativity reduces conflict misses

Adapted from Patterson & Hennessy, Computer Architecture: A Quantitative Approach, 2011
Dr. Sajid Muhaimin Choudhury 68
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

68

33
DRAFT 7/16/2023

Miss Rate Trends

• Bigger blocks reduce compulsory misses


• Bigger blocks increase conflict misses

Dr. Sajid Muhaimin Choudhury 69


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

69

Multilevel Caches
• Larger caches have lower miss rates, longer
access times
• Expand memory hierarchy to multiple levels of
caches
• Level 1: small and fast (e.g. 16 KB, 1 cycle)
• Level 2: larger and slower (e.g. 256 KB, 2-6 cycles)
• Most modern PCs have L1, L2, and L3 cache

Dr. Sajid Muhaimin Choudhury 70


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

70

34
DRAFT 7/16/2023

Intel Pentium III Die

Dr. Sajid Muhaimin Choudhury 71


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

71

Multi-Core Intel – Shared Cache

Dr. Sajid Muhaimin Choudhury 72


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

72

35
DRAFT 7/16/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Memory:
Cache - Problems

415 Lecture 7.3


Week 7

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

73

• AMAT = 1 + 0.1 (100) = 11

Dr. Sajid Muhaimin Choudhury 74


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

74

36
DRAFT 7/16/2023

• Each memory access checks the L1 cache.


• When the L1 cache misses (5% of the time), the processor checks the L2 cache.
• When the L2 cache misses (20% of the time), the processor fetches the data
from MM

• AMAT = 1 + (5/100) * 10 + (20/100) * 100 = 2.5

Dr. Sajid Muhaimin Choudhury 75


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

75

AMAT = 1 + 100*m

→ m = (AMAT – 1)/100

→ m = (1.5-1)/100 = 0.5%

Dr. Sajid Muhaimin Choudhury 76


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

76

37
DRAFT 7/16/2023

• 0x14 = 0b00010100

• Word aligned, so ignore 2 LSBs

• Word maps to set 5

• Any address with bits 5,4,3 as


101 will map to this set, eg:
0x34, 0x54, 0x74

Dr. Sajid Muhaimin Choudhury 77


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

77

• 1024 sets require log2(210) = 10 bits.

• Two LSBs are byte offset

• 32-10-2 = 20 bits form tag

Dr. Sajid Muhaimin Choudhury 78


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

78

38
DRAFT 7/16/2023

3*5 = 15
Memory access
0x4 = 00000100,
0xC = 00001100
0x8 = 00001000
1 0…..000 mem[0x00...0C]
1 0…..000 mem[0x00...08]
1 0…..000 mem[0x00...04]

Dr. Sajid Muhaimin Choudhury 79


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

79

• Memory addresses 0x4 and 0x24 both


map to set 1.

• During the initial execution of the loop,


data at address 0x4 is loaded into set 1 of
the cache. Then data at address 0x24 is
loaded into set 1, evicting the data from
address 0x4.

Upon the second execution of the loop,
the pattern repeats and the cache must
refetch data at address 0x4, evicting data
from address 0x24. The two addresses
conflict,

• the miss rate is 100%.


Dr. Sajid Muhaimin Choudhury 80
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

80

39
DRAFT 7/16/2023

• Memory addresses 0x4 and 0x24


both map to set 1. Two-way cache
1 0…..000 mem[0x00...02] 1 0…..000 mem[0x00...04]
can accommodate data from both
address.

• During the first loop iteration, the


empty cache misses both address
and loads
both words of data into the two ways
of set 1
• On the next four iterations, the
cache hits. Hence, the miss rate is
2/10 = 20%

• Dr. Sajid Muhaimin Choudhury 81


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

81

On the first loop iteration,


the cache misses on the
access to memory address
0x4.
This access loads data at
addresses 0x0 through 0xC
into the cache block.
1 00..00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] All subsequent accesses
(as shown for address 0xC)
hit in the cache. Hence, the
miss rate
is 1/15 = 6.67%.
Dr. Sajid Muhaimin Choudhury 82
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

82

40
DRAFT 7/16/2023

• All four store instructions write to the same cache block.


• With a writethrough cache, each store instruction writes a word to main memory,
requiring four main memory writes.
• A write-back policy requires only one main memory access, when the dirty cache block
is evicted.
Dr. Sajid Muhaimin Choudhury 83
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

83

• (a) C = S ×b ×N ×4 bytes
• (b) bits = S×N×[A – log2b – log2S – 2]
• (c) S = 1, N = C/b
• (d) S = C/b

Dr. Sajid Muhaimin Choudhury 84


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

84

41
DRAFT 7/16/2023

• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20

Dr. Sajid Muhaimin Choudhury 85


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

85

• 40 44 48 4C 70 74 78 7C 80 84 88 8C 90 94 98 9C 0 4 8 C 10 14 18 1C 20

Dr. Sajid Muhaimin Choudhury 86


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

86

42
DRAFT 7/16/2023

• 74 A0 78 38C AC 84 88 8C 7C 34 38 13C 388 18C

(a) direct mapped cache, b = 1 word


(b) fully associative cache, b = 2 words
(c) two-way set associative cache, b = 2 words
(d) direct mapped cache, b = 4 words

Dr. Sajid Muhaimin Choudhury 87


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

87

• 74 A0 78 38C AC 84 88 8C 7C 34 38 13C 388 18C 15(F) 7C 13C


14(E) 78 38
13(D) 74 34
12(C)
11(B) AC
(a) direct mapped cache, b = 1 word
10(A)
Hex Binary Tag Tag (Dec)
74 000001110100 1101 13
9
For repeating sequence
A0 000010100000 1000 8 8 A0
78 000001111000 1110 14 14 reads, 11 misses
38C
7
001110001100 0011 3
AC 000010101100 1011 11
Miss rate = 11/14 6
84 000010000100 0001 1
5
88 000010001000 0010 2
8C 000010001100 0011 3 4
7C
34
000001111100 1111
000000110100 1101
15
13
3 38C 8C 18C
38 000000111000 1110 14 2 88 388
13C 000100111100 1111 15 1 84
388 001110001000 0010 2
18C 0 Muhaimin Choudhury 88
Dr. Sajid
EEE000110001100 0011
415 - Department 3 of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

88

43
DRAFT 7/16/2023

• (a) 1024/8 = 128


• (b) all unique tags
0 00000000
8 00001000
10 00010000
18 00011000
20 00100000
28 00101000
• 100% miss rate
0 00000000
8 00001000 • (c) Increase block size 4
10 00010000 words.
18 00011000
20 00100000
28 00101000
• Miss rate 50%
Dr. Sajid Muhaimin Choudhury 89
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

89

• (b) Each tag is 16 bits.


There are 32Kwords / (2 words / block) = 16K blocks and
each block needs a tag: 16 × 16K = 218 = 256 Kbits of tags
Dr. Sajid Muhaimin Choudhury 90
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

90

44
DRAFT 7/16/2023

• (c) Each cache block requires:


• 2 status bits (V & D),
• 16 bits of tag,
• 2 words = 2*32 data bits,
• With associativity 2, Each set size: 2*(2+16+64) = 2*82 = 164

Dr. Sajid Muhaimin Choudhury 91


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

91

• The design must use enough RAM chips to handle both the total capacity
and the number of bits that must be read on each cycle. For the data, the SRAM
must provide a capacity of 128 KB and must read 64 bits per cycle (one 32-bit
word from each way)
• Thus the design needs at least 4*32KB / (8KB/RAM) = 16 RAM cells to hold the
data and 64 bits / (4 pins/RAM) = 16 RAMs to supply the number of bits. These
are equal, so the design needs exactly 16 RAMs for the data
Dr. Sajid Muhaimin Choudhury 92
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

92

45
DRAFT 7/16/2023

• For the tags, the total capacity is 32 KB, from which 32 bits (two 16-bit tags)
must be read each cycle.
• Therefore, only 4 RAMs are necessary to meet the capacity, but 8 RAMs are
needed to supply 32 bits per cycle. Therefore, the design will need 8 RAMs,
each of which is being used at half capacity
• With 8K sets, the status bits require another 8K × 4-bit RAM. We use a 16K ×
4-bit RAM, using only half of the entries
Dr. Sajid Muhaimin Choudhury 93
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

93

Dr. Sajid Muhaimin Choudhury 94


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

94

46
DRAFT 7/16/2023

Dr. Sajid Muhaimin Choudhury 95


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

95

• (a) The word in memory might be found in two


locations, one in the on-chip cache, and one in the
off-chip cache

(b) For the first-level cache, the number of sets, S = 512 /


4 = 128 sets. Thus, 7 bits of the address are set bits. The
block size is 16 bytes / 4 bytes/word = 4 words, so there
are 2 block offset bits. Thus, the number of tag bits for
the first-level cache is 32 - (7+2+2) = 21 bits.

For the second-level cache, the number of sets is equal to


the number of blocks, S = 256 Ksets.
Thus, 18 bits of the address are set bits. The block size is
16 bytes / 4 bytes/word = 4 words, so there are 2 block
offset bits. Thus, the number of tag bits for the second-
level cache is 32 - (18+2+2) = 10 bits

Dr. Sajid Muhaimin Choudhury 96


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

96

47
DRAFT 7/16/2023

AMAT = tcache + MRcache(tL2cache + MRL2cache tMM)

= ta + (1-A)[tb+(1-B)tm]

When the first-level cache is enabled, the second-level


cache receives only the “hard” accesses, ones that don’t
show enough temporal and spatial locality to hit in the
first-level cache. The “easy” accesses (ones with good
temporal and spatial locality) hit in the first-level
cache, even though they would have also hit in the
second-level cache.

When the first-level cache is disabled, the hit rate goes up


because the second-level cache supplies both the “easy”
accesses and some of the “hard” accesses.

Dr. Sajid Muhaimin Choudhury 97


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

97

• AMAT = tcache + MRcache tMM


• With a cycle time of 1/1 GHz = 1 ns,
AMAT = 1 ns + 0.15(200 ns) = 31 ns

CPI = 31 + 4 = 35 cycles (for a load)


CPI = 31 + 3 = 34 cyles (for a store)
Dr. Sajid Muhaimin Choudhury 98
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

98

48
DRAFT 7/16/2023

• Average CPI = (0.11 + 0.02)(3) + (0.52)(4) + (0.1)(34) + (0.25)(35) = 14.6

• Average CPI = 14.6 + 0.1(200) = 34.6

Dr. Sajid Muhaimin Choudhury 99


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

99

Summary
• Memory hierarchy and management - Harris 8.1
• Cache performance - Harris 8.2
• Cache, cache policies, multi-level cache - Harris 8.3 (8.3.1-8.3.4)

In the interest of brevity, the following topics will not be


covered: Multicore computing, message passing, shared
memory, cache-coherence protocol, memory consistency, paging

Dr. Sajid Muhaimin Choudhury 100


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

100

49
DRAFT 7/16/2023

Summary: Microprocessor and Embedded System


Microprocessor Part:
• Patterson Ch-1, Lecture Slides: Fundamentals of microprocessor and computer design, Intro to
CISC and RISC, complexity, metrics, and benchmark;
• Harris Ch-6: processor data path, architecture, microarchitecture, Instruction Set Architecture,
Assembly language programming of Arm based microprocessors (jump, call-return, stack, push and
pop, shift, rotate, logic instructions),
• Harris Ch-7: Instruction-Level Parallelism, pipelining, pipelining hazards and data dependency, branch
prediction, exceptions and limits, super-pipelined vs superscalar processing; Multicore computing,
• Harris Ch-8, Lecture Slides: Memory hierarchy and management, cache, cache policies, multi-level
cache, cache performance;

• In the interest of time, the following topics could not be covered in class and (may be excluded from
syllabus): message passing, shared memory, cache-coherence protocol, memory consistency, virtual
memory, paging, Vector Processor, Graphics Processing Unit, IP Blocks, Single Instruction Multiple
Data and SoC with microprocessors.

Dr. Sajid Muhaimin Choudhury 101


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

101

50

You might also like