You are on page 1of 70



Advanced RISC Machines

Introduction and Architecture

ARM Applications

Why ARM here ???

ARM is one of the most licensed and thus p processor cores in the world widespread p Used especially in portable devices due to low power consumption and reasonable performance Several interesting extensions available like Thumb instruction set and Jazelle Java machine

RISC vs CISC Architecture vs.

RISC Fixed width instructions CISC Variable length instructions

Few formats of instructions Several formats of instructions Load/Store Architecture Memory values can be used as operands in instructions Large Register bank Small Register Bank Instructions are pipelinable Pipelining is Complex

RISC vs. CISC Organization vs

RISC Hardwired instruction decode Single cycle execution of all instructions CISC Microcode ROMS instruction decoder Multi cycle execution on instruction

RISC Advantages
A Smaller Die Size A Shorter Development Time Higher Performance (Bit Tricky)

Smaller things have higher natural frequencies

RISC Disadvantages
Generally poor code density (Fixed g ) Length Instruction)

ARM History
ARM Acorn RISC Machine(19831985)

Acorn Computers Limited, Cambridge, p , g , England

ARM Advanced RISC Machine 1990

ARM Limited, 1990 ARM has been licensed to many y semiconductor manufacturers

Semiconductor Partners

Features used from RISC

A Load/Store Architecture Fixed Length 32-bit Instructions 32 bit 3- Address Instruction Formats

Load Store Architecture

Memory can be accessed only through two dedicated instructions


; move word from memory to register ; move word from register to memory

All other instructions have to work on registers only.

3 Address Instruction Format

f bits
Example Add d, s1, s2 ; d =s1+s2

n bits
op 1 addr.

n bits
op 2 addr.

n bits
dest. addr.

Features Rejected from Berkeley RISC Single Cycle Execution of ALL Si l C l E ti f Instructions

Single Memory for Instruction & Data Even a simple load/store will require at least p q two cycles Separate Data & Instruction was the solution p but was too costly those times These GUYS used the extra cycle for y something useful such as supporting autoindexing

ARM Design Policy

Arm A core uses RISC A hit t Architecture

Reduced Instruction Set Load Store A hit t L d St Architecture Large No of General Purpose Registers Parallel execution with Pipelines Enhanced i t ti E h d instructions f for
THUMB state DSP Instructions Conditional Execution Instructions 32 bit Barrel Shifter

But some differences from RISC

ARM has Load Store Architecture General Purpose Registers can hold data or address Total f T t l of 37 Registers, each of 32 bit R i t h f There are 17 or 18 active Registers

16 data registers 2 status registers t t i t

Registers R0 R12 are General Purpose R0 g Registers R1 R2 R3 R13 is used as Stack Pointer (sp) R4 R5 R14 i used as Li k register (l ) is d Link i t (lr) R6 R7 R8 R15 is used as Program Counter (pc) R9 R10 CPSR is Current Program Status RegisterR11 R12 SPSR is Saved Program Status Register R13 R14

Registers(3) g ( )
31 28 24 23 16 15 8 7 6 5 4 0



mode d

hold information about the most recently performed ALU operation set the processor operating mode Condition code flags Interrupt Disable bits. N = Negative result from ALU
Z = Zero result from ALU C = ALU operation Carried out V = ALU operation oVerflowed

I = 1: Disables the IRQ. F = 1: Disables the FIQ. Architecture xT only T = 0: Processor in ARM state T = 1: Processor in Thumb state

T Bit

J bit
Architecture 5TEJ only J = 1: Processor in Jazelle state

Mode bits
Specify the processor mode

Operation Modes
Mode User FIQ IRQ Supervisor Mode Abort Undefined Instruction U d fi d I t ti System Registers User _fiq _irq i _svc _abt _und d User CPSR[4:0] 10000 10001 10010 10011 10111 11011 11111

Processor Modes

Banked Registers
General registers and Program Counter User32 / System r0 r1 r2 r3 r4 r5 r6 r7 7 r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) FIQ32 r0 r1 r2 r3 r4 r5 r6 r7 7 r8_fiq r9_fiq r10_fiq r11_fiq q r12_fiq r13_fiq r14_fiq r15 (pc) Supervisor32 r0 r1 r2 r3 r4 r5 r6 r7 7 r8 r9 r10 r11 r12 r13_svc r14_svc r15 (pc) Program Status Registers cpsr cpsr spsr_fiq spsr fiq sprsr_fiq sprsr fiq cpsr spsr_svc spsr svc cpsr spsr_abt spsr abt cpsr sprsr_fiq sprsr irq spsr_irq spsr fiq cpsr spsr_undef spsr undef sprsr_fiq sprsr fiq Abort32 r0 r1 r2 r3 r4 r5 r6 r7 7 r8 r9 r10 r11 r12 r13_abt r14_abt r15 (pc) IRQ32 r0 r1 r2 r3 r4 r5 r6 r7 7 r8 r9 r10 r11 r12 r13_irq r14_irq r15 (pc) Undefined32 r0 r1 r2 r3 r4 r5 r6 r7 7 r8 r9 r10 r11 r12 r13_undef r14_undef r15 (pc)

The Memory System

Memory may be viewed as linear array of y bytes number from 0 to 2^32 1 Data Bytes may be 8-bit (B), 16-bit (HW), or 32 bit (W) 32-bit Words are always aligned at 4-byte boundaries i.e least two bits are zero Half Words are aligned on even boundaries

ARM Memory Organization

bit 31
23 19 15 5 11 7 3 22 18 21 17 13 3 9 5 1

bit 0
20 16 12 8 4 0

14 10


half-word12 word8


half-word4 byte1 byte0



byte y address

Memory Formats
Little Endian (Default) Big Endian
bit 31 bit 0 bit 0 bit 31

byte 3 byte 2 byte 1 byte 0 little-Endian

byte 0 byte 1 byte 2 byte 3 big-Endian

ARM supports Both

ARM Exceptions
ARM supports range of Interrupts, Traps, p , g p Supervisor Calls, all grouped under general heading of Exceptions

Generated by internal and external events Supports 7 type of exceptions pp yp p

Reset only in supervisor Mode Software Interrupt in Supervisor Mode p p IRQ on IRQ interrupt FIQ on FIQ interrupt Data Abort in Abort mode Undefined Instruction in undefined mode Prefetch Abort in Abort mode

Vector Addresses
0x00000000 0x00000004 0x00000008 0x0000000C 0x00000010 0x00000014 0x00000018 0x0000001C

Reset Undefined Instruction Software Interrupt Prefetch Abort Data Abort Reserved IRQ FIQ Q

Exception Priorities
1. 2. 2 3. 4. 5. 5 6. Reset (Highest Priority) Data Abort FIQ IRQ Prefetch Abort SWI, Undefined

ARM RoadMap

ARM Processor Families

Version V i 4 Supports

Thumb : 16 bit compressed instruction set Debug : On chip debug support Enhanced Multiply : higher performance, long multiply Embedded ICE hardware 32 bit data bus Data size can be byte , half word or word word, Words : 4 byte aligned Half word : 2 byte aligned

Von Neumann A hi V N Architecture

ARM core dataflow model

Data Instruction Decoder Sign Extend Read r15 pc Rn A Rm B A Barrel Shifter N ALU MAC B Acc Register File r0 r15 0 15 Rd Result

Address Register Incrementer Address


Operating States
Supports 2 instruction sets

ARM 32 bit instruction se 3 b s uc o set Thumb 16 bit instruction set

Thumb State
Subset f th S b t of the ARM instructions i t ti higher code density ( g y (35% reduction) ) better performance than 16 bit processors Suitable for use with 16 bit memory devices
(160% b better performance) f )

Transparently decompressed to 32 bit p y p instructions

ARM State
Able to access more large memories y efficiently 32 bit integer arithmetic in a single cycle More number of i t ti M b f instructions Better performance

Switching States
ARM to Thumb

Execute the BX instruction with s a e b ecu e e s uc o state bit=1 Execute the BX i t ti with state bit 0 E t th instruction ith t t bit=0 An interrupt or exception occurs

Thumb to ARM

Which State to use

Low memory system : use thumb 16 bit memory : use thumb Performance is critical : use ARM

Example : in execution of interrupt routines

Performance is critical AND Memory is low : use both ARM and thumb example : i i t l in interrupt routines t ti

Advanced RISC Machines

ARM Instruction Set

ARM Features
Load-Store Architecture 3 Address 3-Address Data Processing Instructions Conditional Execution of Instructions Powerful Load/Store Multiple Register General Shift Operation (Single Cycle) Extension of Instruction Set Co-processor 16-bit Compressed Instruction Set

Types of Instructions
Data Processing Instructions Data Transfer Instructions Control Flow Instructions

Conditional Execution
Mnemonic EQ NE CS HS CC LO MI PL VS VC HI LS GE LT GT LE AL 8/17/2011 Name equal not equal carry set/unsigned higher or same y / g g carry clear/unsigned lower minus/negative plus/positive or zero overflow no overflow unsigned higher unsigned lower or same signed greater than or equal signed less than i dl th signed greater than signed less than or equal always (unconditional) C-DAC,Hyderabad Condition Flags Z z C c N n V v zC Z or c NV or nv Nv N or nV V NzV or nzv Z or Nv or nV ignored 41

Data Processing Operations

Arithmetic Operations Bit wise Bit-wise Operations Register Movement Operations Comparison Operations

Arithmetic Operations
ADD r0, r1, r2 DD 0 1 2 ADC r0 r1 r2 r0, r1, SUB r0, r1, r2 SBC r0, r1, r2 RSB r0, r1, r2 RSC r0, r1, r2 r0 := r1 + r2 0 1 2 r0 := r1 + r2 + C r0 := r1 - r2 r0 := r1 - r2 + C - 1 r0 := r2 r1 r0 := r2 r1 + C - 1

Bit-wise Bit wise Logical Operations

AND r0, r1, r2 ND 0 1 2 ORR r0, r1, r2 EOR r0, r1, r2 BIC r0, r1, r2 0 1 2 r0 := r1 and r2 0 1 d 2 r0 := r1 or r2 r0 := r1 xor r2 r0 := r1 and ( t) r2 0 1 d (not) 2

Register Movement Operations

MOV r0, r2 MVN r0 r2 r0, r0 := r2 r0 := not r2

Comparison Operations
CMP r1, r2 1 2 CMN r1, r2 TST r1, r2 TEQ r1, r2 1 2 set cc on r1 - r2 1 2 set cc on r1 + r2 set cc on r1 and r2 set cc on r1 xor r2 t 1 2

Immediate Operands
If we need to add constant

ADD r3, r3, #1 ; r3 := r3 + 1 3, 3, # 3 3 AND r8, r7, #&ff ; r8 := r7[7:0]

Shift Register Operands

Second register operand is subjected to shift before it is combined with first operand ADD r3 r2 r1 LSL #3 r3, r2, r1, ; r3 := r2 + (r1*8)

ARM Shift Operations

LSL- Logical Shift Left LSR LSR- Logical Shift Right ASL- Arithmetic Shift Left ASR- Arithmetic Shift Right ROR- Rotate Right RRX- Rotate Right Extended


31 0 31 00000 00000

LSL #5
31 0 0 31 1

LSR #5

00000 0

1 1 11 1 1

ASR #5

, positive operand 0 C

ASR #5

, negative operand

ROR #5


Shift Value in Register

It is also possible to use a register value p y to specify the number of bits the second operand should be shifted by: ADD r5, r5, r3, LSL r2 r5: r5 + r3 * 2^r2

Setting the Condition Codes

All DPI can affect the condition codes For all DPI except comparisons a special request needs to be made At assembly l bl level th request i made b l the t is d by adding an S to opcode Eg: ADDS r0, r0, r1 0 0 1 ADC r3, r3, r2 , ,

MUL r4, r3, r2 4 3 2 Some Rules

Immediate second operand not supported The result register must not be the same as the first source register

The Basic ARM provides two multiplication instructions. instructions

MUL{<cond>}{S} Rd Rm Rs Rd, Rm, ; Rd = Rm * Rs

Multiply Accumulate - does addition for free

MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn

Multiply-Long and Multiply-Accumulate Long

Instructions are

MULL which gives RdHi,RdLo:=Rm*Rs MLAL which gives RdHi,RdLo:=(Rm*Rs)+RdHi,RdLo

However the full 64 bit of the result now matter (lower precision multiply instructions simply throws top 32bits away) p y)

Need to specify whether operands are signed or unsigned

Loading full 32 bit constants

Although the MOV/MVN mechanism will load a large range of constants into a register, sometimes this mechanism will not generate the required constant. Therefore, Therefore the assembler also provides a method which will load ANY 32 bit constant:
LDR rd,=numeric constant

If the constant can be constructed using either a MOV or MVN then this will be the instruction actually generated. Otherwise, the assembler will produce an LDR instruction with a p PC-relative address to read the constant from a literal pool.
LDR r0,=0x42 ; generates MOV r0,#0x42 LDR r0,=0x55555555 ; generate LDR r0,[pc, offset to lit pool]

As this mechanism will always generate the best instruction for a given case it is the recommended way of loading constants case, constants.

Branch Instructions
Branch Branch with Link Branch Exchange Branch Exchange with Link B{<cond>} label BL{<cond>} label { } BX{<cond>} Rm BLX {<cond>} Rm

Load / Store Instructions

Single Register Load & Store

transfer o a da a item (by e, half-word, word) a s e of data e (byte, a o d, o d) between ARM registers and memory enable transfer of large quantities of data used for procedure entry and exit, to save/restore workspace registers, to copy blocks of data around memory

Multiple Register Load & Store

Single register data transfer

The basic load and store instructions are:

Load and Store Word or Byte


ARM Architecture Version 4 also adds support for halfwords and signed data.

Load and Store Halfword


Load Signed Byte or Halfword - load value and sign extend it to 32 bits.

All of these instructions can be conditionally executed by inserting the appropriate condition code after STR / LDR.

e.g. LDREQB <LDR|STR>{<cond>}{<size>} Rd, <address>


Register-Indirect Register Indirect Addressing

LDR r0, [r1] STR r0, [ 1] 0 [r1] r0 := mem32[r1] mem32[ 1] := r0 [r1] 0

Note: r1 keeps a word address (2 LSBs are 0)

Pre Indexed Addressing

LDR r0, [r1, #4] r0 := mem32[r1]

Post Indexed Addressing

LDR r0, [r1], #4 r0 := mem32[r1] r1 := r1 + 4

Auto Indexing Addressing

LDR r0, [ 1 #4]! 0 [r1, r0 := mem32[ 1 + 4] 0 [r1 r1 := r1 + 4

Direct functionality of Block Data Transfer

When Wh LDM / STM are not being used to tb i dt implement stacks, it is clearer to specify exactly what functionality of the instruction is:

i.e. specify whether to increment / decrement the base pointer, before or after the memory access. pointer access

In order to do this, LDM / STM support a further syntax in addition to the stack one:

STMIA / LDMIA : Increment After STMIB / LDMIB : Increment Before STMDA / LDMDA : Decrement After STMDB / LDMDB : Decrement Before

Stack Operation
Traditionally, Traditionally a stack grows down in memory, with the last pushed memory pushed value at the lowest address. The ARM also supports ascending stacks, where the stack structure grows up through memory. The value of the stack pointer can either:

Point to the last occupied address (Full stack)

and so needs pre-decrementing (ie before the push)

Point to the next occupied address (Empty stack)

and so needs post-decrementing (ie after the push)

The stack type to be used is given by the postfix to the instruction:

STMFD / LDMFD : Full Descending stack g STMFA / LDMFA : Full Ascending stack. STMED / LDMED : Empty Descending stack STMEA / LDMEA : Empty Ascending stack py g

Note: ARM Compiler will always use a Full descending stack.

Stack Examples
STMFD sp!, {r0,r1,r3 r5} {r0,r1,r3-r5} STMED sp!, {r0,r1,r3 r5} {r0,r1,r3-r5} STMFA sp!, {r0,r1,r3 r5} {r0,r1,r3-r5} STMEA sp!, {r0,r1,r3 r5} {r0,r1,r3-r5}

SP r5 r4 4 r3 r1 r0 0 SP r5 r4 r3 r1 r0

Old SP


r5 r4 r3 r1 r0

r5 r4 r3 r1 r0

Old SP

Old SP




PSR Transfer Instructions

MRS and MSR allow contents of CPSR/SPSR to be transferred from appropriate status register to a general purpose register.

All of status register, or just the flags, can be transferred. ; Rd = <psr> ; <psr> = Rm ; <psrf> = Rm

MRS{<cond>} Rd,<psr> MSR{<cond>} <psr>,Rm MSR{<cond>} <psrf> Rm <psrf>,Rm

<psr> = CPSR, CPSR_all, SPSR or SPSR_all < <psrf> = CPSR fl or SPSR fl f> CPSR_flg SPSR_flg

Also an immediate form

MSR{<cond>} <psrf>,#Immediate Thi immediate must be a 32 bit i This i di t tb 32-bit immediate, of which th 4 di t f hi h the most significant bits are written to the flag bits.

Write W it a short code segment that performs a h t d t th t f mode change by modifying the contents of the CPSR


The mode you should change to is user mode which has the value 0x10. This Thi assumes that the current mode is a priveleged th t th t d i i l d mode such as supervisor mode. This would happen for instance when the processor is reset - reset code would be run in supervisor mode which would then need to switch to user mode before calling the main routine in your application. You will need to use MSR and MRS, plus 2 logical operations.
28 8 4 0




Set up useful constants: mmask EQU 0x1f bits userm EQU 0x10 ; mask to clear mode ; user mode value

Start off here in supervisor mode.

MRS BIC ORR MSR r0, cpsr; take a copy of the CPSR r0,r0,#mmask ; clear the mode bits r0,r0,#userm ; select new mode cpsr r0; write back the modified cpsr, ; CPSR

End up here in user mode.

Main features of the ARM Instruction Set

All i t ti instructions are 32 bit l bits long. Most instructions execute in a single cycle. Every instruction can be conditionally executed. y y A load/store architecture Data processing instructions act only on registers Three operand format Combined ALU and shifter for high speed bit manipulation Specific memory access instructions with powerful autoindexing addressing modes. 32 bit and 8 bit data types and also 16 bit data types on ARM Architecture v4 v4. Flexible multiple register load and store instructions Instruction set extension via coprocessors

In todays systems the key is not raw processor speed but total effective system performance and power consumption

Done with Day 1