Professional Documents
Culture Documents
W3 081001 ARM Processor Architecture
W3 081001 ARM Processor Architecture
q T: Thumb
q D: On-chip debug support
q M: Enhanced multiplier
q I: Embedded ICE hardware
q T2: Thumb-2
q S: Synthesizable code
q E: Enhanced DSP instruction set
q J: JAVA support, Jazelle
q Z: Should be TrustZone?
q F: Floating point unit
q H: Handshake, clockless design for synchronous or
asynchronous design
qARM8 → ARM9
→ ARM10
qARM9
– 5-stage pipeline (130 MHz or 200MHz)
– Using separate instruction and data memory ports
qARM 10 (1998. Oct.)
– High performance, 300 MHz
– Multimedia digital consumer applications
– Optional vector floating-point unit
Core Architecture
ARM1 v1
ARM2 v2
ARM2as, ARM3 v2a
ARM6, ARM600, ARM610 v3
ARM7, ARM700, ARM710 v3
ARM7TDMI, ARM710T, ARM720T, ARM740T v4T
StrongARM, ARM8, ARM810 v4
ARM9TDMI, ARM920T, ARM940T V4T
ARM9E-S, ARM10TDMI, ARM1020E v5TE
ARM10TDMI, ARM1020E v5TE
ARM11 MPCore, ARM1136J(F)-S, ARM1176JZ(F)-S v6
Cortex-A/R/M v7
address register
– 2 read ports, 1 write ports, access
P
any register
C incrementer
– 1 additional read port, 1 additional
PC write port for r15 (PC)
register
bank
q Barrel Shifter
instruction
– Shift or rotate the operand by any
decode
A multiply &
number of bits
L
q ALU
register
U control
A B
b
u
s
b
u
s barrel
b
u
s
q Address register and
shifter
incrementer
ALU
q Data Registers
– Hold data passing to and from
memory
q Fetch
– The instruction is fetched from memory and placed in the instruction pipeline
q Decode
– The instruction is decoded and the datapath control signals prepared for the
next cycle
q Execute
– The register bank is read, an operand shifted, the ALU result generated and
written back into destination register
increment increment
Rd PC Rd PC
registers registers
Rn Rm Rn
mult mult
as ins. as ins.
as instruction as instruction
[7:0]
increment increment
PC Rn PC
registers registers
Rn Rd
mult mult
lsl #0 shifter
=A /A+ B / A- B = A + B /A - B
[11:0]
(a) 1st cycle - compute address (b) 2nd cycle - store data & auto-index
increment increment
R14
registers registers
PC PC
mult mult
lsl #2 shifter
=A+ B =A
[23:0]
(a) 1st cycle - compute branch target (b) 2nd cycle - save return address
q The third cycle, which is required to complete the pipeline refilling, is also
used to mark the small correction to the value stored in the link register
in order that is points directly at the instruction which follows the branch
SOC Consortium Course Material 23
Branch Pipeline Example
q Decode
instruction
decode
register read
immediate
fields – The instruction is decoded and
LDM/
mul register operands read from the
+4
STM post-
index reg
register files. There are 3 operand
shift
pre-index
shift
execute
read ports in the register file so most
mux
ALU forwarding
paths ARM instructions can source all their
B, BL operands in one cycle
MOV pc
q Execute
SUBS pc
byte repl.
q Write back
instruction
decode
register read
immediate
fields – The result generated by the
LDM/
mul instruction are written back to the
+4
STM post-
index reg
register file, including any data
shift
pre-index
shift
execute
loaded from memory
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
q Forwarding works as
pc
+4
I-cache fetch
pc + 4
follows:
pc + 8 I decode – The ALU result from the
r15
instruction EX/MEM register is always fed
decode
register read back to the ALU input latches.
immediate
fields
– If the forwarding hardware
LDM/
mul
detects that the previous ALU
STM
operation has written the
post-
+4 index reg
shift shift
pre-index
execute register corresponding to the
ALU
source for the current ALU
forwarding
paths
mux
B, BL
MOV pc operation, control logic selects
SUBS pc
byte repl.
the forwarded result as the ALU
input rather than the value read
load/store
address
D-cache buffer/
data from the register file.
rot/sgn ex
LDR pc forwarding paths
register write write-back
1 2 3 4 5 6 7 8
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID EXsub MEM WB
AND R6,R1,R7 IF ID EXand MEM WB
OR R8,R1,R9 IF ID EXE MEM WB
1 2 3 4 5 6 7 8 9
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID stall EXsub MEM WB
AND R6,R1,R7 IF stall ID EX MEM WB
OR R8,R1,R9 stall IF ID EX MEM WB
q 8-stage pipeline
q Data forwarding and branch prediction
– Dynamic/static branch prediction
q Improved memory access
– Non-blocking
– Hit-under-miss
q Pipeline parallism
– ALU/MAC, LSU
– LS instruction won’t stall the pipeline
– Out-of-order completion
Pipeline Length 5 6 7 8
scan chain 2
extern0 Embedded scan chain 0
extern1
ICE
opc, r/w,
mreq, trans,
mas[1:0]
A[31:0] processor other
core signals
Din[31:0]
bus JTAG TAP
splitter controller
Dout[31:0]
q ARM710T q ARM720T
– 8K unified write through cache – As ARM 710T but with WinCE
support
– Full memory management unit
supporting virtual memory q ARM 740T
– 8K unified write through cache
– Write buffer
– Memory protection unit
– Write buffer
ARM10TDMI
q Core Organization PC instructions
memory integer
– The prefetch unit is responsible for (double-
bandwidth) read data unit
fetching instructions from memory and
CPinst. CPdata
buffering them (exploiting the double write data
bandwidth memory)
coprocessor(s)
– It is also responsible for branch prediction
and use static prediction based on the
branch prediction (backward: predicted
‘taken’; forward: predicted ‘not taken’)
SOC Consortium Course Material 58
Pipeline Organization
inst. decode
decode
register read
coproc
data multiplier
ALU/shifter execute
write
pipeline
+4 mux
write
data
address
memory
read
data
forwarding rot/sgn ex
paths
write
register write
copy-back tag
– Coprocessor
copy-back data
CP15
– Write buffer
physical address
address buffer
qHarvard architecture
– Increases available memory bandwidth
• Instruction memory interface
• Data memory interface
– Simultaneous accesses to instruction and data memory
can be achieved
q5-stage pipeline
qChanges implemented to
– Improve CPI to ~1.5
– Improve maximum clock frequency
pc + 8 I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM post-
+4 index reg
shift shift
pre-index
execute
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
ARM9TDMI:
instruction r. read data memory reg
fetch shift/ALU access write
decode
Not sufficient slack time to translate Thumb instructions into ARM instructions and
then decode, instead the hardware decode both ARM and Thumb instructions
directly
virtual DA
supporting virtual
virtual IA
CP15
addressing and
instruction data memory protection
MMU ARM9TDMI MMU
– Write buffer
physical DA
EmbeddedICE
& JTAG
physical
address tag
write
AMBA interface
buffer
copy-back DA
physical IA
AMBA AMBA
address data
q ARM 940T
external – 2 × 4K caches
coprocessor
interface – Memory protection
Unit
Protection Unit
instruction data – Write buffer
cache cache
ARM9TDMI
data address
EmbeddedICE
instructions
data
& JTAG
I address
write
AMBA interface
buffer
AMBA AMBA
address data
0.13um
Main memory
FF..FF16
registers
instructions
processor
instructions
address and data
data
copies of
instructions address
copies of
data
memory
cache
00..0016
instructions
and data
instructions
cache
address instructions
instructions
registers
processor
address
copies of
data
data memory
cache
00..00 16
19 9 4
q The 8Kbytes of data in
address: tag index line 16-byte lines. There
would therefore be 512
lines
tag RAM data RAM
q A 32-bit address:
– 4 bits to address bytes
512
within the line
lines – 9 bits to select the line
– 19-bit tag
compare mux
hit data
hit data
compare mux
compare mux
256
lines
tag RAM data RAM
SOC Consortium Course Material 88
Fully Associative Cache
hit data
address
line q The 8Kbytes of data in
16-byte lines. There
would therefore be 512
tag CAM data RAM lines
512 q A 32-bit address:
– 4 bits to address bytes
lines within the line
– 28-bit tag
mux
hit data
C compiler assembler
ARMsd
system model
development
ARMulator
board