You are on page 1of 4

ARM integer cores

The 3-stage ARM pipeline

Outline:
the ARM 3-stage pipeline the ARM7TDMI core the ARM 5-stage pipeline the ARM9TDMI core the ARM10TDMI core

fetch
the instruction is fetched from memory

decode
the instruction is decoded and the datapath control signals prepared for the next cycle

execute
the operands are read from the register bank, shifted, combined in the ALU and the result written back

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 1

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 2

The 3-stage ARM pipeline


1 fetch decode execute

The 3-stage ARM pipeline

More complex instructions:


execute

fetch

decode

execute

fetch ADD decode

3 instruction

fetch

decode

execute time

fetch STR

decode

calc. addr. data xfer

fetch ADD

decode

execute

Single cycle instructions


complete at a rate of one per clock cycle

fetch ADD

decode

execute

5 instruction

fetch ADD decode time

execute

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 3

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 4

The 3-stage ARM pipeline

3-stage ARM organization

PC behaviour
r15 increments twice before an instruction executes
due to pipeline operation

ARM components:
register bank
2 read ports, 1 write port plus additional read and write ports for r15

therefore r15 = address of instruction + 8


(+12 if used after first cycle, though this is architecturally undefined) in Thumb code the offset is +4

normally the assembler makes the necessary adjustments, e.g. in branches


2001 PEVEIT Unit - ARM System Design Cores - v4 - 5

barrel shifter ALU address register and incrementer memory data registers instruction decoder and control
Cores - v4 - 6

2001 PEVEIT Unit - ARM System Design

A[31:0] address register P C

control

ARM integer cores


PC

3-stage ARM organization

incrementer

register bank instruction decode A L U b u s A b u s multiply register B b u s & control

Outline:
the ARM 3-stage pipeline the ARM7TDMI core the ARM 5-stage pipeline the ARM9TDMI core the ARM10TDMI core

barrel shifter

ALU

data out register

data in register D[31:0]

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 7

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 8

The ARM7TDMI

ARM7TDMI organization
scan chain 2 extern0 extern1 opc, r/w, mreq, trans, mas[1:0] A[31:0] D[31:0]

The ARM7TDMI is...


an ARM7 3-stage pipeline core, with T - support for the Thumb instruction set D - support for debug
the processor can stop on a debug event

Embedded ICE

scan chain 0

processor core
scan chain 1

other signals

M - support for long multiplies I - the EmbeddedICE macrocell


provides breakpoint and watchpoint hardware
described later

Din[31:0] Dout[31:0]

bus splitter

JTAG TAP controller

TCK TMSTRST TDI TDO


2001 PEVEIT Unit - ARM System Design Cores - v4 - 9 2001 PEVEIT Unit - ARM System Design Cores - v4 - 10

clock control configuration

mclk wait eclk bigend irq fiq isync reset enin enout enouti abe ale ape dbe tbe busen highz busdis ecapclk dbgrq breakpt dbgack exec extern1 extern0 dbgen rangeout0 rangeout1 dbgrqi commrx commtx opc cpi cpa cpb Vdd Vss

A[31:0] Din[31:0] Dout[31:0] D[31:0] bl[3:0] r/w mas[1:0] mreq seq lock trans mode[4:0] abort Tbit

The ARM7TDMI core interface signals

interrupts

memory interface

ARM7TDMI

initialization

ARM7TDMI debug support


the EmbeddedICE module
supports breakpoints and watchpoints controlled via the JTAG test access port

bus control

MMU interface state

ARM7TDMI core

tapsm[3:0] ir[3:0] tdoen tck1 tck2 screg[3:0] drivebs ecapclkbs icapclkbs highz pclkbs rstclkbs sdinbs sdoutbs shclkbs shclk2bs TRST TCK TMS TDI TDO

TAP information

EmbeddedICE & JTAG are covered later


boundary scan extension

debug

ARM7TDMI characteristics:
0.35 m 3 3.3 V Transistors Core area Clock 74,209 2 2.1 mm 0 to 66 MHz MIPS Power MIPS/W 60 87 mW 690

coprocessor interface power

Process Metal layers Vdd

JTAG controls

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 11

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 12

ARM7TDMI

ARM integer cores

Outline:
the ARM 3-stage pipeline the ARM7TDMI core the ARM 5-stage pipeline the ARM9TDMI core the ARM10TDMI core

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 13

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 14

Getting higher performance

The 5-stage ARM pipeline


Fetch Decode

Increase the clock rate


the clock rate is limited by the slowest pipeline stage
decrease the logic complexity per stage increase the pipeline depth (number of stages)

instruction decode and register read


Execute
shift and ALU

improve the CPI (clocks per instruction)


fewer wasted cycles
better memory bandwidth

Memory
data memory access

Write-back
Cores - v4 - 16

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 15

2001 PEVEIT Unit - ARM System Design

The 5-stage ARM pipeline

ARM9TDMI

Reducing the CPI


ARM7 uses the memory on nearly every clock cycle
for either instruction fetch or data transfer

The ARM9TDMI is
a classic Harvard architecture 5-stage pipeline
separate instruction and data memory ports

therefore a reduced CPI requires


more than one memory access per clock cycle

Possible solutions are:


separate instruction and data memories double-bandwidth memory (e.g. ARM8)

with full support for Thumb and EmbeddedICE debug aimed at significantly higher performance than the ARM7TDMI
enhanced pipeline operates at 100-200 MHz
2001 PEVEIT Unit - ARM System Design Cores - v4 - 18

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 17

ARM9TDMI pipeline
ARM7TDMI: Fetch
instruction fetch

ARM9TDMI pipeline
Execute
reg read shift/ALU reg write

next pc

+4 I-cache fetch

pc + 4

pc + 8 r15

I decode instruction decode


immediate elds

Decode
Thumb decompress ARM decode

very similar to StrongARM


see CPU section
LDM/ STM

register read

mul +4
postindex

shift ALU

reg shift

pre-index

execute
forwarding paths

ARM9TDMI:
instr uction fetch r. read decode shift/ALU data memor y access reg write

no separate branch adder

mux
B, BL MOV pc SUBS pc

byte repl. buffer/ data

Fetch

Decode

Execute

Memory

Write
load/store address

D-cache

Thumb instructions are decoded directly


2001 PEVEIT Unit - ARM System Design Cores - v4 - 19 2001 PEVEIT Unit - ARM System Design

rot/sgn ex
LDR pc

register write

write-back

Cores - v4 - 20

ARM9TDMI

ARM9TDMI

EmbeddedICE
as ARM7TDMI, plus:
hardware single-stepping breakpoints on exceptions

On-chip coprocessor support


for floating-point, DSP, and so on
0.25 m 3 2.5 V Transistors Core area Clock 111,000 2 2.1 mm 0-200 MHz MIPS Power MIPS/W 220 150 mW 1,500

Process Metal layers Vdd

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 21

2001 PEVEIT Unit - ARM System Design

Cores - v4 - 22

ARM10TDMI

ARM10TDMI pipeline
branch prediction instruction fetch r. read decode addr. calc. shift/ALU multiply data memory access multiplier par tials add data write reg write

The ARM10TDMI is
aimed at significantly higher performance than the ARM9TDMI achieved through use of:
higher clock rate 64-bit I- and D-memory buses branch prediction hit-under-miss D-memory interface

decode

Fetch

Issue

Decode

Execute

Memory

Write

Additional time allowed for


I- and D-memory accesses instruction decode

6-stage pipeline
2001 PEVEIT Unit - ARM System Design Cores - v4 - 23 2001 PEVEIT Unit - ARM System Design Cores - v4 - 24