Professional Documents
Culture Documents
4750 - 21 - ARM Processor Architecture
4750 - 21 - ARM Processor Architecture
T: Thumb
D: On-chip Debug support
M: Enhanced Multiplier
I: Embedded ICE hardware
T2: Thumb-2
S: synthesizable code
E: Enhanced DSP instruction set
J: JAVA support, Jazelle
Z: Should be TrustZone?
F: Floating point unit
H: Handshake, clockless design for synchronous or
asynchronous design
ARM8 → ARM9
→ ARM10
ARM9
– 5-stage pipeline (130 MHz or 200MHz)
– Using separate instruction and data memory ports
ARM 10 (1998. Oct.)
– High performance, 300 MHz
– Multimedia digital consumer applications
– Optional vector floating-point unit
Core Architecture
ARM1 v1
ARM2 v2
ARM2as, ARM3 v2a
ARM6, ARM600, ARM610 v3
ARM7, ARM700, ARM710 v3
ARM7TDMI, ARM710T, ARM720T, ARM740T v4T
StrongARM, ARM8, ARM810 v4
ARM9TDMI, ARM920T, ARM940T V4T
ARM9E-S, ARM10TDMI, ARM1020E v5TE
ARM10TDMI, ARM1020E v5TE
ARM11 MPCore, ARM1136J(F)-S, ARM1176JZ(F)-S v6
Cortex-A/R/M v7
address register
– 2 read ports, 1 write ports, access
P
any register
C incrementer
– 1 additional read port, 1 additional
PC write port for r15 (PC)
register
bank
Barrel Shifter
instruction
decode
– Shift or rotate the operand by any
A multiply &
number of bits
L register
U
b
A B
control
ALU
u
s
b
u
s barrel
b
u
s
Address register and
shifter
incrementer
ALU Data Registers
– Hold data passing to and from
memory
data out register data in register
Instruction Decoder and
D[31:0] Control
SOC Consortium Course Material 17
3-Stage Pipeline (1/2)
Fetch
– The instruction is fetched from memory and placed in the instruction pipeline
Decode
– The instruction is decoded and the datapath control signals prepared for the
next cycle
Execute
– The register bank is read, an operand shifted, the ALU result generated and
written back into destination register
increment increment
Rd PC Rd PC
registers registers
Rn Rm Rn
mult mult
as ins. as ins.
as instruction as instruction
[7:0]
increment increment
PC Rn PC
registers registers
Rn Rd
mult mult
lsl #0 shifter
=A / A+ B / A- B =A + B /A - B
[11:0]
(a) 1st cycle - compute address (b) 2nd cycle - store data & auto-index
increment increment
R14
registers registers
PC PC
mult mult
lsl #2 shifter
=A+ B =A
[23:0]
(a) 1st cycle - compute branch target (b) 2nd cycle - save return address
The third cycle, which is required to complete the pipeline refilling, is also
used to mark the small correction to the value stored in the link register
in order that is points directly at the instruction which follows the branch
SOC Consortium Course Material 23
Branch Pipeline Example
register read
decode
Decode
immediate
fields – The instruction is decoded and
LDM/
mul register operands read from the
+4
STM post -
index
shift reg
register files. There are 3 operand
read ports in the register file so most
shift
pre-index
execute
mux
ALU forwarding
paths ARM instructions can source all their
B, BL
MOV pc
operands in one cycle
Execute
SUBS pc
byte repl.
load/store
D-cache buffer/ – An operand is shifted and the ALU
data
address
result generated. If the instruction is
LDR pc
rot/sgn ex
a load or store, the memory address
is computed in the ALU
register write write-back
register read
decode
Write back
immediate
fields – The result generated by the
LDM/
mul instruction are written back to the
+4
STM post -
index
shift reg
register file, including any data
loaded from memory
shift
pre-index
execute
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
Forwarding works as
pc
+4
I-cache fetch
pc + 4
follows:
pc + 8 I decode – The ALU result from the
r15
instruction
decode
EX/MEM register is always fed
register read back to the ALU input latches.
immediate
fields
– If the forwarding hardware
LDM/
STM
mul
detects that the previous ALU
post -
+4 index
shift reg
shift
operation has written the
pre-index
execute register corresponding to the
ALU forwarding
mux paths source for the current ALU
operation, control logic selects
B, BL
MOV pc
byte repl.
input rather than the value read
load/store
address
D-cache buffer/
data from the register file.
rot/sgn ex
LDR pc forwarding paths
register write write-back
1 2 3 4 5 6 7 8
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID EXsub MEM WB
AND R6,R1,R7 IF ID EXand MEM WB
OR R8,R1,R9 IF ID EXE MEM WB
1 2 3 4 5 6 7 8 9
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID stall EXsub MEM WB
AND R6,R1,R7 IF stall ID EX MEM WB
OR R8,R1,R9 stall IF ID EX MEM WB
8-stage pipeline
Data forwarding and branch prediction
– Dynamic/static branch prediction
Improved memory access
– Non-blocking
– Hit-under-miss
Pipeline parallism
– ALU/MAC, LSU
– LS instruction won’t stall the pipeline
– Out-of-order completion
Pipeline Length 5 6 7 8
Instruction Issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order
scan chain 2
extern0 Embedded scan chain 0
extern1
ICE
opc, r/w,
mreq, trans,
mas[1:0]
A[31:0] processor other
core signals
Din[31:0]
bus JTAG TAP
Dout[31:0]
splitter controller
ARM710T ARM720T
– 8K unified write through cache – As ARM 710T but with WinCE
support
– Full memory management unit
supporting virtual memory ARM 740T
– 8K unified write through cache
– Write buffer
– Memory protection unit
– Write buffer
8 ARM10TDMI
Core Organization
PC instructions
memory integer
– The prefetch unit is responsible for (double-
bandwidth) read data unit
fetching instructions from memory and
CPinst. CPdata
buffering them (exploiting the double write data
bandwidth memory)
coprocessor(s)
– It is also responsible for branch prediction
and use static prediction based on the
branch prediction (backward: predicted
‘taken’; forward: predicted ‘not taken’)
SOC Consortium Course Material 58
Pipeline Organization
inst. decode
decode
register read
coproc
data multiplier
ALU/shifter execute
write
pipeline
+4 mux
write
data
address
memory
read
data
forwarding rot/sgn ex
paths
write
register write
copy-back tag
– Coprocessor
copy-back data
CP15
– Write buffer
physical address
address buffer
Harvard architecture
– Increases available memory bandwidth
• Instruction memory interface
• Data memory interface
– Simultaneous accesses to instruction and data memory
can be achieved
5-stage pipeline
Changes implemented to
– Improve CPI to ~1.5
– Improve maximum clock frequency
pc + 8 I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM post-
+4 index reg
shift shift
pre-index
execute
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
ARM9TDMI:
instruction r. read data memory reg
fetch shift/ALU access write
decode
Not sufficient slack time to translate Thumb instructions into ARM instructions and
then decode, instead the hardware decode both ARM and Thumb instructions
directly
instruction
coprocessor
interface data
– Full memory
cache cache
management unit
virtual DA
supporting virtual
virtual I A
CP15
addressing and
instruction data memory protection
MMU ARM9TDMI MMU
– Write buffer
physical DA
EmbeddedICE
& JTAG
physical
address tag
write
AMBA interface
buffer
copy-back DA
physical IA
AMBA AMBA
address data
ARM 940T
external – 2 × 4K caches
coprocessor
interface
– Memory protection
Unit
Protection Unit
instruction data – Write buffer
cache cache
ARM9TDMI
data address
EmbeddedICE
instructions
data
& JTAG
I address
write
AMBA interface
buffer
AMBA AMBA
address data
0.13um
Main memory
FF..FF16
registers
instructions
processor
instructions
address and data
data
copies of
instructions address
copies of
data
cache memory
instructions 00..0016
and data
instructions
cache
address instructions
instructions
registers
processor
address
copies of
data
data memory
cache
00..0016
19 9 4
The 8Kbytes of data in
address: tag index line 16-byte lines. There
would therefore be 512
lines
tag RAM data RAM A 32-bit address:
– 4 bits to address bytes
512
within the line
lines – 9 bits to select the line
– 19-bit tag
compare mux
hit data
hit data
compare mux
compare mux
256
lines
tag RAM data RAM
SOC Consortium Course Material 88
Fully Associative Cache
hit data
address
line The 8Kbytes of data in
16-byte lines. There
would therefore be 512
tag CAM data RAM lines
256 A 32-bit address:
– 4 bits to address bytes
lines within the line
– 28-bit tag
mux
hit data
C compiler assembler
ARMsd
system model
development
ARMulator
board