Lecture6 ARM

ARM ORGANISATION
• Computer Architecture is abstract model and are

those attributes that are visible to programmer like
instructions sets, no of bits used for data,
addressing techniques.
• A computer's organization expresses the
realization of the architecture. OR how features are
implemented like these registers ,those data paths
or this connection to memory. contents of CO are
ALU, CPU and memory and memory organizations.
• Computer architecture refers to those
attributes of system visible to a programmer
and they have a direct impact on logical
execution of a program.
Computer organisation refers to operational

units and their interconnection that realize the
architectural specifications.
• EXAMPLE 1:
• Suppose you are in a company that manufactures cars, design and overall
details of the car come under computer architecture
(abstract,programmers view), while making it’s parts piece by piece and
connecting together the different components of that car by keeping the
basic design in mind comes under computer organization (physical and
visible).
• EXAMPLE 2:
• For example, both Intel and AMD processors have the same X86
architecture, but how the two companies implement that architecture
(their computer organizations) is usually very different. The same programs
run correctly on both, because the architecture is the same, but they may
run at different speeds, because the organizations are different.
Pipeline stages (for different family of ARM
processor)
3-stage pipeline ARM organization
The register
bank, which stores
the processor state.
Barrel Shifter, which

can shift or rotate one
operand by any
ALU, performs number
the of bits.
arithmetic and logic
functions required by
the instruction set.
3-stage pipeline ARM organization
Address register and

incrementer, select and hold all
memory addresses and
generate sequential addresses
when required.
Data Register, which hold

data passing to and from
memory.
1. In a single-cycle data processing instruction, two
registers operands are accessed, the value on the B
bus is shifted and combined with the value on the A
bus in the ALU, then the result is written back into
the register bank.
2. The program counter value is in the address register,
from where it is fed into the incrementer, the
incremented value is copied back into r15 in the
register bank and also into the address register to be
used as the address for the next instruction fetch if
needed.
The 3-stage pipeline
ARM processors up to the ARM7 employ a simple 3-stage
pipeline with the following pipeline stages
1. Fetch
2. Decode
3. Execute
ARM single-cycle instruction 3-stage pipeline operation
1. When the processor is executing simple data

processing instructions the pipeline
enables one instruction to be completed every
clock cycle.
2. An individual instruction takes three clock
cycles to complete, so it has three-cycle
latency, but the throughput is one instruction
per cycle.
3-stage pipeline operation
ARM Multi Cycle instruction
Multiple register data transfer instructions
Example of ldmia – load, increment after
ldmia r9, {r0-r3} @ register 9 holds the
@ base address
This has the same effect as four separate ldr instructions, or

ldr r0, [r9]
ldr r1, [r9, #4]
ldr r2, [r9, #8]
ldr r3, [r9, #12]
Note: at the end of the ldmia instruction, register r9 has not

been changed. If you wanted to change r9, you could simply
use
ldmia r9!, {r0,r2,r5}
21
Multiple register data transfer instuctions
ldmia – Example
ldmia r9, {r0-r3, r12}
• Load words addressed by r9 into r0, r1, r2, r3, and r12
• Increment r9 after each load.
Example 3
ldmia r9, {r5, r3, r0-r2, r14}
• load words addressed by r9 into registers r5, r3, r0, r1,
r2, and r14.
• Increment r9 after each load.
• ldmib, ldmda, ldmdb work similar to ldmia
• Stores work in an analogous manner to load instructions
22
Store Multiples
Load and Store Multiples
IA IB DA DB
LDMxx r10, {r0,r1,r4} r4
STMxx r10, {r0,r1,r4}
r4 r1
r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address
r1 r4
r0 r1
r0
The mapping between the stack and block copy views
of the load and store multiple instructions
LDMFD == restore from stack

STMFD == save registers onto stack
1. As a result of the issues, higher performance ARM cores
employ a 5-stage pipeline and have separate instruction and
data memories.
2. Breaking instruction execution down into five components
rather than three reduces the maximum work which must be
completed in a clock cycle, and hence allows a higher clock
frequency to be used.
3. The separate instruction and data memories allow a
significant reduction in the core's CPI.
Recall - ARM family 7 and 9
5 stage pipe line ARM organization
The time T, required to execute a given program is given by :
N inst  CPI
Tprog 
f clk
where,
N inst - Number of ARM instructions executed in the course of the program
CPI - Average number of clock cycles per instructions
f clk - Processor' s clock frequency
Since Ninst is constant for a given program (compiled with a

given compiler using a given set of optimizations, and so on)
there are only two ways to increase performance.
1. Increase the clock rate, fclk.
• This requires the logic in each pipeline stage to be
simplified and, therefore, the number of pipeline stages
to be increased.
2. Reduce the average number of clock cycles per instruction,
CPI.
• This requires either that instructions which occupy more
than one pipeline slot in a 3-stage pipeline ARM are re-
implemented to occupy fewer slots, or that pipeline
stalls caused by dependencies between instructions are
reduced, or a combination of both.
Instruction Execution
Store Instruction
Branch Instruction
Write the instructions required and pipeline stages for the instructions to do the following operation
a=b+c
a=b+c
• Running this code segment will need some forwarding.

• But instructions LW and ALU(Add or Sub), when put in sequence,
are generating hazards for the pipeline that can not be resolved by
forwarding.
• So the pipeline will stall. Observe that in time steps 4, 5, and 6,
there are two forwards from the Data memory unit to the ALU in
the EX stage of the Add instruction.
• Write a program to add 32 bit numbers
• Find the one’s complement of the given number.
[use MVN instruction – which acts as Not instruction]
• Swapping :
if value is 4E ( only 8 bits – remaining bits 0)
result should be E4
• Sum of n numbers
• Find the smallest/ largest of 2 numbers
• Find the smallest of n numbers
Eg1
• Consider that there are 3-stages in an
instruction and each stage takes 1 minute,
• what is the time taken to finish 3 instructions
in a non pipeline processor?
• What is the average time taken for an
instruction in a non pipeline processor?
• Similarly for pipeline processor
ANS
• Non Pipeline = 9 mins
• Average time in non pipeline = 3 mins
• Pipeline processor = 5 mins

Eg.
• A 5-stage pipelined processor has Instruction
Fetch(IF),Instruction Decode(ID),Execute (EX) ,
MEM and Write Operand(WO)stages.
• The IF,ID, MEM and WO stages take 1 clock
cycle each for any instruction.
• The EX stage takes 1 clock cycle for ADD and
SUB instructions,3 clock cycles for MUL
instruction and 6 clock cycles for DIV
instruction respectively.
For the next page instructions --
• What is the number of clock cycles required if
is a non-pipelined processor ?
it is a pipelined processor without forwarding

it is pipelined processor with forwarding?
Instruction sequence
I1 :MUL R2 ,R0 ,R1

I2 :DIV R5 ,R3 ,R4
I3 :ADD R2 ,R5 ,R2
I4 :SUB R5 ,R2 ,R6

Lecture6 ARM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture6 ARM

Uploaded by

Copyright:

Available Formats

ARM ORGANISATION

• Computer Architecture is abstract model and are

Computer organisation refers to operational

Barrel Shifter, which

Address register and

Data Register, which hold

1. When the processor is executing simple data

This has the same effect as four separate ldr instructions, or

Note: at the end of the ldmia instruction, register r9 has not

LDMFD == restore from stack

Since Ninst is constant for a given program (compiled with a

• Running this code segment will need some forwarding.

• Pipeline processor = 5 mins

• What is the number of clock cycles required if

I1 :MUL R2 ,R0 ,R1

You might also like