You are on page 1of 50

ARM Processor Architecture

(II)
Speaker: Lung-Hao Chang 張龍豪
Advisor: Porf. Andy Wu 吳安宇教授

Graduate Institute of Electronics Engineering,


National Taiwan University

Modified from National Chiao-Tung University


IP Core Design course
Outline
 ARM processor core
 Memory hierarchy
 Software development
 Summary

ARM Platform Design SOC Consortium Course Material 09/21/2003 2


ARM Processor Core

ARM Platform Design SOC Consortium Course Material 09/21/2003 3


ARM7TDMI Processor Core
 Current low-end ARM core for applications like
digital mobile phones
 TDMI
– T: Thumb, 16-bit compressed instruction set
– D: on-chip Debug support, enabling the processor to halt
in response to a debug request
– M: enhanced Multiplier, yield a full 64-bit result, high
performance
– I: EmbeddedICE hardware
 Von Neumann architecture
 3-stage pipeline
 CPI ~ 1.9
ARM Platform Design SOC Consortium Course Material 09/21/2003 4
ARM7TDMI Block Diagram

scan chain 2
extern0 Embedded scan chain 0
extern1
ICE
opc, r/w,
mreq, trans,
mas[1:0]
A[31:0] processor other
core signals

D[31:0] scan chain 1

Din[31:0]
bus JTAG TAP
splitter controller
Dout[31:0]

TCK TMSTRST TDI TDO

ARM Platform Design SOC Consortium Course Material 09/21/2003 5


ARM7TDMI Core Diagram

ARM Platform Design SOC Consortium Course Material 09/21/2003 6


ARM7TDMI Interface Signals (1/4)
mclk A[31:0]
clock
control wait
Din[31:0]
eclk
configuration bigend Dout[31:0]

irq D[31:0] memory


interrupts ¼q interface
isync bl[3:0]
r/w
initialization reset mas[1:0]
mreq
enin
enout seq
lock
enouti
abe trans
ale MMU
mode [4:0] interface
bus ape abort
control dbe
tbe Tbit state
busen
highz ARM7TDMI tapsm[3:0]
busdis ir[3:0]
ecapclk core tdoen TAP
tck1 information
dbgrq
tck2
breakpt
screg[3:0]
dbgack
exec drivebs
extern1 ecapclkbs
extern0 icapclkbs
debug dbgen highz
rangeout0 boundary
pclkbs scan
rangeout1 rstclkbs extension
dbgrqi sdinbs
commrx sdoutbs
commtx shclkbs
opc shclk2bs
coprocessor cpi
interface cpa TRST
TCK JTAG
cpb
TMS controls
Vdd TDI
power
Vss TDO

ARM Platform Design SOC Consortium Course Material 09/21/2003 7


ARM7TDMI Interface Signals (2/4)
 Clock control
– All state change within the processor are controlled by mclk, the
memory clock
– Internal clock = mclk AND \wait
– eclk clock output reflects the clock used by the core
 Memory interface
– 32-bit address A[31:0], bidirectional data bus D[31:0], separate data
out Dout[31:0], data in Din[31:0]
– seq indicates that the memory address will be sequential to that used
in the previous cycle

ARM Platform Design SOC Consortium Course Material 09/21/2003 8


ARM7TDMI Interface Signals (3/4)
– Lock indicates that the processor should keep the bus to ensure the
atomicity of the read and write phase of a SWAP instruction
– \r/w, read or write
– mas[1:0], encode memory access size – byte, half-word or word
– bl[3:0], externally controlled enables on latches on each of the 4
bytes on the data input bus
 MMU interface
– \trans (translation control), 0: user mode, 1: privileged mode
– \mode[4:0], bottom 5 bits of the CPSR (inverted)
– Abort, disallow access
 State
– T bit, whether the processor is currently executing ARM or Thumb
instructions
 Configuration
– Bigend, big-endian or little-endian

ARM Platform Design SOC Consortium Course Material 09/21/2003 9


ARM7TDMI Interface Signals (4/4)
 Interrupt
– \fiq, fast interrupt request, higher priority
– \irq, normal interrupt request
– isync, allow the interrupt synchronizer to be passed
 Initialization
– \reset, starts the processor from a known state, executing from
address 0000000016
 ARM7TDMI characteristics

ARM Platform Design SOC Consortium Course Material 09/21/2003 10


Memory Access
 The ARM7 is a Von Neumann,
load/store architecture, i.e.,
– Only 32 bit data bus for both inst. And data.
– Only the load/store inst. (and SWP) access
memory.
 Memory is addressed as a 32 bit
address space
 Data type can be 8 bit bytes, 16 bit
half-words or 32 bit words, and may be
seen as a byte line folded into 4-byte
words
 Words must be aligned to 4 byte
boundaries, and half-words to 2 byte
boundaries.
 Always ensure that memory controller
supports all three access sizes
ARM Platform Design SOC Consortium Course Material 09/21/2003 11
ARM Memory Interface
 Sequential (S cycle)
– (nMREQ, SEQ) = (0, 1)
– The ARM core requests a transfer to or from an address which is either the
same, or one word or one-half-word greater than the preceding address.
 Non-sequential (N cycle)
– (nMREQ, SEQ) = (0, 0)
– The ARM core requests a transfer to or from an address which is unrelated to
the address used in the preceding address.
 Internal (I cycle)
– (nMREQ, SEQ) = (1, 0)
– The ARM core does not require a transfer, as it performing an internal
function, and no useful prefetching can be performed at the same time
 Coprocessor register transfer (C cycle)
– (nMREQ, SEQ) = (1, 1)
– The ARM core wished to use the data bus to communicate with a
coprocessor, but does no require any action by the memory system.

ARM Platform Design SOC Consortium Course Material 09/21/2003 12


Cached ARM7TDMI Macrocells
EmbeddedICE & JTAG CP15

ARM Core

Physical
Address AMBA
MMU
Virtual AMBA Address
Address Interface AMBA
Inst. & data
Data

Write
Inst. & data cache
Buffer

 ARM710T  ARM720T
– 8K unified write through – As ARM 710T but with WinCE
cache support
– Full memory management  ARM 740T
unit supporting virtual – 8K unified write through cache
memory
– Memory protection unit
– Write buffer
– Write buffer
ARM Platform Design SOC Consortium Course Material 09/21/2003 13
Processor Core Vs CPU Core
 Processor Core
– The engine that fetches instructions and execute them
– E.g.: ARM7TDMI, ARM9TDMI, ARM9E-S

 CPU Core
– Consists of the ARM processor virtual address

core and some tightly coupled


function blocks ARM7TDMI
instruction &
– Cache and memory MMU
data cache
EmbeddedICE

management blocks & JTAG

– E.g.: ARM710T, ARM720T,


ARM74T, ARM920T,

address
physical
instructions & data

ARM922T, ARM940T, write


CP15
ARM946E-S, and ARM966E-S buffer

AMBA interface

AMBA AMBA
address data
ARM710T

ARM Platform Design SOC Consortium Course Material 09/21/2003 14


ARM8
 Higher performance than ARM7
– By increasing the clock rate
– By reducing the CPI
• Higher memory bandwidth, 64-bit wide memory
• Separate memories for instruction and data accesses

 ARM8 ARM9TDMI
prefetch
ARM10TDMI addresses
unit

 Core Organization
– The prefetch unit is responsible for PC instructions

fetching instructions from memory memory


(double- integer
unit
and buffering them (exploiting the bandwidth) read data

double bandwidth memory) write data


CPinst. CPdata

– It is also responsible for branch


prediction and use static prediction coprocessor(s)

based on the branch prediction


(backward: predicted ‘taken’;
forward: predicted ‘not taken’)
ARM Platform Design SOC Consortium Course Material 09/21/2003 15
Pipeline Organization
 5-stage, prefetch unit occupies the 1st stage,
integer unit occupies the remainder

(1) Instruction prefetch Prefetch Unit

(2) Instruction decode and register read


(3) Execute (shift and ALU)
Integer Unit
(4) Data memory access
(5) Write back results

ARM Platform Design SOC Consortium Course Material 09/21/2003 16


Integer Unit Organization
instructions PC+8
coprocessor
instructions

inst. decode

decode

register read

coproc
data multiplier

ALU/shifter execute

write
pipeline

+4 mux
write
data
address memor y
read
data

forwarding rot/sgn ex
paths

write
register write

ARM Platform Design SOC Consortium Course Material 09/21/2003 17


ARM8 Macrocell
 ARM810
– 8K byte unified instruction
prefetch
and data cache
unit
virtual address
– Copy-back
8 Kbyte cache
(double- PC instructions – Double-bandwidth
bandwidth)
read data
ARM8 integer
unit
– MMU
write data
CPinst. CPdata – Coprocessor
copy-back tag – Write buffer
copy-back data
CP15

JTAG write buffer MMU

physical address
address buffer

data in data out address

ARM Platform Design SOC Consortium Course Material 09/21/2003 18


StrongARM
 The first ARM processor to use a modified-Harvard
(separate instruction and data cache) architecture
and now available from Intel
 Feature
– A 5-stage pipeline with register forwarding
– Single-cycle execution of all common instruction s except
64-bit multiplies
– Instruction cache/copy-back data cache
– Write buffer
– Pseudo-static operation with low power consumption

ARM Platform Design SOC Consortium Course Material 09/21/2003 19


StrongARM core pipeline organization
next
pc
+4
I-cache fetch
pc + 4

branch
offset
I decode
pc + 8

r15 instruction
decode
+ disp
branch register read
B, BL target
immediate
¼elds
MOV pc

LDM/
STM post-
+4 index reg
shift
shift
pre-index
execute
ALU & multiply forwarding
paths
mux

SUBS pc

rotate

D-cache buffer/
load/store data
address

rot/sgn ex
LDR pc

register write write-back

ARM Platform Design SOC Consortium Course Material 09/21/2003 20


StrongARM Processor
 SA-1110/SA-1111
– Intel SA-1 core
– 16-Kbyte instruction and 8-Kbyte data cache
– MMU, read and write buffers
– 512-byte mini-data cache

ARM Platform Design SOC Consortium Course Material 09/21/2003 21


ARM9TDMI
 Harvard architecture
– Increases available memory bandwidth
• Instruction memory interface
• Data memory interface
– Simultaneous accesses to instruction and data memory
can be achieved
 5-stage pipeline
 Changes implemented to
– Improve CPI to ~1.5
– Improve maximum clock frequency

ARM Platform Design SOC Consortium Course Material 09/21/2003 22


ARM9TDMI Organization
next
pc
+4
I-cache fetch
pc + 4

pc + 8 I decode
r15
instruction
decode
register read
immediate
fields

mul
LDM/
STM post-
+4 index reg
shift shift
pre-index
execute
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc

byte repl.

D-cache buffer/
load/store data
address

rot/sgn ex
LDR pc

register write write-back

ARM Platform Design SOC Consortium Course Material 09/21/2003 23


ARM9TDMI Pipeline Operations (1/2)
ARM7TDMI: Fetch Decode Execute

instruction Thumb ARM reg reg


fetch decompress decode read shift/ALU write

ARM9TDMI:
instr uction r. read data memor y reg
fetch shift/ALU access write
decode

Fetch Decode Execute Memory Write

 The ARM9TDMI pipeline is much tighter and does not have


sufficient slack time to allow Thumb instructions to be first
translate into ARM instructions and then decoded
 It has hardware to decode both ARM and Thumb instructions
directly
ARM Platform Design SOC Consortium Course Material 09/21/2003 24
ARM9TDMI Pipeline Operations (2/2)
 Coprocessor support
– Coprocessors: floating-point, digital signal processing, special-
purpose hardware accelerator
 On-chip debugger
– Additional features compared to ARM7TDMI
• Hardware single stepping
• Breakpoint can be set on exceptions
 ARM9TDMI characteristics

ARM Platform Design SOC Consortium Course Material 09/21/2003 25


Cached ARM9TDMI Macrocell
 ARM920T
– ARM9TDMI
– 16KB instruction cache, 16KB data cache
– Full Memory Management Unit, Write Buffer
 ARM922T
– ARM9TDMI
– 8KB instruction cache, 8KB data cache
– Full Memory Management Unit, Write Buffer
 ARM940T
– ARM9TDMI
– 4KB instruction cache, 4KB data cache
– Protection Unit
ARM Platform Design SOC Consortium Course Material 09/21/2003 26
ARM920T CPU Core

ARM Platform Design SOC Consortium Course Material 09/21/2003 27


ARM9E-S Family Overview
 ARM9E-S is based on an ARM9TDMI with the following
extensions:
– Single cycle 32*6 multiplier implementation
– EmbeddedICE logic RT
– Improved ARM/Thumb interworking
– New 32*16 and 16*16 multiply instructions Architecture v5TE
– New count leading zero instruction
– New saturated math instructions
 ARM946E-S
– ARM9E-S core
– Instruction and data caches, selectable sizes
– Instruction and data RAMs, selectable sizes
– Protection unit
– AHB bus interface
ARM Platform Design SOC Consortium Course Material 09/21/2003 28
ARM10TDMI (1/2)
 High-end ARM processor core
 Performance on the same IC process
ARM10TDMI ARM9TDMI ARM7TDMI
×2 ×2

 300MHz, 0.25µm CMOS


 Increase clock rate
ARM10TDMI

branch addr. data memory data


prediction calc. access write

instruction r. read shift/ALU multiplier reg


decode
fetch decode multiply partials add write

Fetch Issue Decode Execute Memory Write

ARM Platform Design SOC Consortium Course Material 09/21/2003 29


ARM10TDMI (2/2)
 Reduce CPI
– Branch prediction
– Non-blocking load and store execution
– 64-bit data memory → transfer 2 registers in each cycle

ARM Platform Design SOC Consortium Course Material 09/21/2003 30


ARM1020T Overview
 Architecture v5T
– ARM1020E will be v5TE
 CPI ~ 1.3
 6-stage pipeline
 Static branch prediction
 32KB instruction and 32KB data caches
– ‘hit under miss’ support
 64 bits per cycle LDM/STM operations
 EmbeddedICE Logic RT-II
 Support for new VFPv1 architecture
 ARM10200 test chip
– ARM1020T
– VFP10
– SDRAM memory interface
– PLL
ARM Platform Design SOC Consortium Course Material 09/21/2003 31
ARM1136J(F)-S
 First Implementations of ARMv6 Architecture
– ARM1136J-S integer-only core
– ARM1136JF-S with integrated floating point
 High speed pipeline microarchitecture
– 8 stages
 System level flexibility
 Low Power
– Microarchitecture designed for low power
– Power management modes
 Availability
– Delivering to first licensees in December 2002
 The ARM11 core has been developed and integrated in
parallel with the ARM11 PrimeXsys Platform to ensure a fully
compatible, high performance, extendable system solution
ARM Platform Design SOC Consortium Course Material 09/21/2003 32
Memory Hierarchy

ARM Platform Design SOC Consortium Course Material 09/21/2003 33


Memory Size and Speed

Small Fast registers Expensive

On-chip cache memory

2nd-level off chip cache

Main memory

Large Slow Hard disk Cheap


Access
capacity Cost
time

ARM Platform Design SOC Consortium Course Material 09/21/2003 34


Caches (1/2)
 A cache memory is a small, very fast memory that
retains copies of recently used memory values.
 It usually implemented on the same chip as the
processor.
 Caches work because programs normally display
the property of locality, which means that at any
particular time they tend to execute the same
instruction many times on the same areas of data.
 An access to an item which is in the cache is called
a hit, and an access to an item which is not in the
cache is a miss.

ARM Platform Design SOC Consortium Course Material 09/21/2003 35


Caches (2/2)
 A processor can have one of the following two
organizations:
– A unified cache
• This is a single cache for both instructions and data
– Separate instruction and data caches
• This organization is sometimes called a modified Harvard
architectures

ARM Platform Design SOC Consortium Course Material 09/21/2003 36


Unified instruction and data cache
FF..FF16

registers
instructions

processor

instructions
address and data
data
copies of
instructions address
copies of
data
memory
cache
instructions 00..0016
and data

ARM Platform Design SOC Consortium Course Material 09/21/2003 37


Separate data and instruction caches
FF..FF16
copies of
instructions address

instructions
cache

address instructions
instructions

registers

processor

address data data

address
copies of
data
data memory
cache
00..0016
ARM Platform Design SOC Consortium Course Material 09/21/2003 38
Cache Write Strategies
 Write-through
– All write operations are passed to main memory
 Write-through with buffered write
– All write operations are still passed to main memory and
the cache updated as appropriate, but instead of slowing
the processor down to main memory speed the write
address and data are stored in a write buffer which can
accept the write information at high speed.
 Copy-back (write-back)
– No kept coherent with main memory

ARM Platform Design SOC Consortium Course Material 09/21/2003 39


Software Development

ARM Platform Design SOC Consortium Course Material 09/21/2003 40


Main Components in ADS
 ANSI C compilers – armcc and tcc
 ISO/Embedded C++ compilers – armcpp and tcpp
 ARM/Thumb assembler - armasm
 Linker - armlink
 Project management tool for windows - CodeWarrior
 Instruction set simulator - ARMulator
 Debuggers - AXD, ADW, ADU and armsd
 Format converter - fromelf
 Librarian – armar
 ARM profiler – armprof
 C and C++ libraries
 ROM-based debug tools (ARM Firmware Suite, AFS)
 Real Time Debug and Trace support
 Support for all ARM cores and processors including ARM9E,
ARM10, Jazelle, StrongARM and Intel Xscale
ADS: ARM Developer Suite
ARM Platform Design SOC Consortium Course Material 09/21/2003 41
ARM C Compiler
 Compiler is compliant with the ANSI standard for C
 Supported by the appropriate library of functions
 Use ARM Procedure Call Standard, APCS for all
external functions
– For procedure entry and exit
 May produce assembly source output
– Can be inspected, hand optimized and then assembled
sequentially
 Can also produce Thumb codes

ARM Platform Design SOC Consortium Course Material 09/21/2003 42


ARM Linker
 Take one or more object files and combine them
 Resolve symbolic references between the object
files and extract the object modules from libraries
 Normally the linker includes debug tables in the
output file

ARM Platform Design SOC Consortium Course Material 09/21/2003 43


ARM Symbolic Debugger
 A front-end interface to debug program running
either under emulator (on the ARMulator) or
remotely on a ARM development board (via a serial
line or through JTAG test interface)
 ARMsd allows an executable program to be loaded
into the ARMulator or a development board and run.
It allows the setting of
– Breakpoints, addresses in the code
– Watchpoints, memory address if accessed as data
address
• Cause exception to halt so that the processor state can be
examined

ARM Platform Design SOC Consortium Course Material 09/21/2003 44


ARM Emulator: ARMulator
 A suite of programs that models the behavior of various ARM
processor cores and system architecture in software on a
host system
 Can be operates at various levels of accuracy
– Instruction accurate
– Cycle accurate
– Timing accurate
 Benchmarking before hardware is available
– Instruction count or number of cycles can be measured for a
program.
– Performance analysis.
 Run software on ARMulator
– Through ARMsd or ARM GUI debuggers, e.g., AXD
– The processor core model incorporates the remote debug interface,
so the processor and the system state are visible from the ARM
symbolic debugger
– Supports a C library to allow complete C programs to run on the
simulated system

ARM Platform Design SOC Consortium Course Material 09/21/2003 45


ARM Development Board
 A circuit board including an ARM core (e.g.
ARM9TDMI), memory component, I/O and
electrically programmable devices
 It can support both hardware and software
development before the final application-specific
hardware is available

ARM Platform Design SOC Consortium Course Material 09/21/2003 46


ARM Integrator
 A mother with some
extensions to support
the development of
applications
 Provides core modules,
logic modules (Xilinx
Virtex FPGA, Alter
APEX FPGA), OS,
input/output resources,
bus arbitration, interrupt
handling

ARM Platform Design SOC Consortium Course Material 09/21/2003 47


Summary (1/2)
 ARM Processor Family

Processor # of pipeline Memory Clock Rate MIPS/MHz


family stages organization
ARM6 3 Von Neumann 25 MHz
ARM7 3 Von Neumann 66 MHz 0.9
ARM8 5 Von Neumann 72 MHz 1.2
ARM9 5 Harvard 200 MHz 1.1
ARM10 6 Harvard 400 MHz 1.25
StrongARM 5 Harvard 233 MHz 1.15
ARM11 8 Von Neumann/ 550 MHz 1.2
Harvard

ARM Platform Design SOC Consortium Course Material 09/21/2003 48


Summary (2/2)
 Memory hierarchy
– Unified cache/Separate instruction and data cache
– Write-through with buffered write
 Software Development
– CodeWarrior IDE
• armcc/tcc/armcpp/tcpp
• armasm
• armlink
• armprof
– AXD (ARM eXtended Debugger)
• armsd
– ARMulator
 ARM Integrator
ARM Platform Design SOC Consortium Course Material 09/21/2003 49
References
[1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html
[2] ARM System-on-Chip Architecture, Second Edition,
edited by S.Furber, Addison Wesley Longman: ISBN 0-201-
67519-6.
[3] Architecture Reference Manual, Second Edition, edited by
D. Seal, Addison Wesley Longman: ISBN 0-201-73719-1.
[4] www.arm.com

ARM Platform Design SOC Consortium Course Material 09/21/2003 50

You might also like