You are on page 1of 50

ARM ORGANISATION

• Computer Architecture is abstract model and are


those attributes that are visible to programmer like
instructions sets, no of bits used for data,
addressing techniques.
• A computer's organization expresses the
realization of the architecture. OR how features are
implemented like these registers ,those data paths
or this connection to memory. contents of CO are
ALU, CPU and memory and memory organizations.
• Computer architecture refers to those
attributes of system visible to a programmer
and they have a direct impact on logical
execution of a program.

Computer organisation refers to operational


units and their interconnection that realize the
architectural specifications.
• EXAMPLE 1:
• Suppose you are in a company that manufactures cars, design and overall
details of the car come under computer architecture
(abstract,programmers view), while making it’s parts piece by piece and
connecting together the different components of that car by keeping the
basic design in mind comes under computer organization (physical and
visible).
• EXAMPLE 2:
• For example, both Intel and AMD processors have the same X86
architecture, but how the two companies implement that architecture
(their computer organizations) is usually very different. The same programs
run correctly on both, because the architecture is the same, but they may
run at different speeds, because the organizations are different.
Pipeline stages (for different family of ARM
processor)
3-stage pipeline ARM organization

The register
bank, which stores
the processor state.

Barrel Shifter, which


can shift or rotate one
operand by any
ALU, performs number
the of bits.
arithmetic and logic
functions required by
the instruction set.
3-stage pipeline ARM organization

Address register and


incrementer, select and hold all
memory addresses and
generate sequential addresses
when required.

Data Register, which hold


data passing to and from
memory.
1. In a single-cycle data processing instruction, two
registers operands are accessed, the value on the B
bus is shifted and combined with the value on the A
bus in the ALU, then the result is written back into
the register bank.
2. The program counter value is in the address register,
from where it is fed into the incrementer, the
incremented value is copied back into r15 in the
register bank and also into the address register  to be
used as the address for the next instruction fetch if
needed.
The 3-stage pipeline
ARM processors up to the ARM7 employ a simple 3-stage
pipeline with the following pipeline stages
1. Fetch
2. Decode
3. Execute
ARM single-cycle instruction 3-stage pipeline operation

1. When the processor is executing simple data


processing instructions the pipeline
enables one instruction to be completed every
clock cycle. 
2. An individual instruction takes three clock
cycles to complete, so it has three-cycle
latency, but the throughput is one instruction
per cycle.
3-stage pipeline operation
ARM Multi Cycle instruction
Multiple register data transfer instructions
Example of ldmia – load, increment after
ldmia r9, {r0-r3} @ register 9 holds the
@ base address

This has the same effect as four separate ldr instructions, or


ldr r0, [r9]
ldr r1, [r9, #4]
ldr r2, [r9, #8]
ldr r3, [r9, #12]

Note: at the end of the ldmia instruction, register r9 has not


been changed. If you wanted to change r9, you could simply
use
ldmia r9!, {r0,r2,r5}

21
Multiple register data transfer instuctions
ldmia – Example
ldmia r9, {r0-r3, r12}
• Load words addressed by r9 into r0, r1, r2, r3, and r12
• Increment r9 after each load.

Example 3
ldmia r9, {r5, r3, r0-r2, r14}
• load words addressed by r9 into registers r5, r3, r0, r1,
r2, and r14.
• Increment r9 after each load.
• ldmib, ldmda, ldmdb work similar to ldmia
• Stores work in an analogous manner to load instructions

22
Store Multiples
Load and Store Multiples

IA IB DA DB
LDMxx r10, {r0,r1,r4} r4
STMxx r10, {r0,r1,r4}
r4 r1

r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address

r1 r4

r0 r1

r0
The mapping between the stack and block copy views
of the load and store multiple instructions

LDMFD == restore from stack


STMFD == save registers onto stack
1. As a result of the issues, higher performance ARM cores
employ a 5-stage pipeline and have separate instruction and
data memories.
2. Breaking instruction execution down into five components
rather than three reduces the maximum work which must be
completed in a clock cycle, and hence allows a higher clock
frequency to be used.
3. The separate instruction and data memories allow a
significant reduction in the core's CPI.
Recall - ARM family 7 and 9
5 stage pipe line ARM organization
The time T, required to execute a given program is given by :

N inst  CPI
Tprog 
f clk
where,
N inst - Number of ARM instructions executed in the course of the program
CPI - Average number of clock cycles per instructions
f clk - Processor' s clock frequency

Since Ninst is constant for a given program (compiled with a


given compiler using a given set of optimizations, and so on)
there are only two ways to increase performance.
1. Increase the clock rate, fclk.
• This requires the logic in each pipeline stage to be
simplified and, therefore, the number of pipeline stages
to be increased.
2. Reduce the average number of clock cycles per instruction,
CPI.
• This requires either that instructions which occupy more
than one pipeline slot in a 3-stage pipeline ARM are re-
implemented to occupy fewer slots, or that pipeline
stalls caused by dependencies between instructions are
reduced, or a combination of both.
Instruction Execution
Store Instruction
Branch Instruction
Write the instructions required and pipeline stages for the instructions to do the following operation

a=b+c
a=b+c

• Running this code segment will need some forwarding.


• But instructions LW and ALU(Add or Sub), when put in sequence,
are generating hazards for the pipeline that can not be resolved by
forwarding.
• So the pipeline will stall. Observe that in time steps 4, 5, and 6,
there are two forwards from the Data memory unit to the ALU in
the EX stage of the Add instruction. 
• Write a program to add 32 bit numbers
• Find the one’s complement of the given number.
[use MVN instruction – which acts as Not instruction]
• Swapping :
if value is 4E ( only 8 bits – remaining bits 0)
result should be E4
• Sum of n numbers
• Find the smallest/ largest of 2 numbers
• Find the smallest of n numbers
Eg1
• Consider that there are 3-stages in an
instruction and each stage takes 1 minute,
• what is the time taken to finish 3 instructions
in a non pipeline processor?
• What is the average time taken for an
instruction in a non pipeline processor?
• Similarly for pipeline processor
ANS
• Non Pipeline = 9 mins
• Average time in non pipeline = 3 mins

• Pipeline processor = 5 mins


Eg.
• A 5-stage pipelined processor has Instruction
Fetch(IF),Instruction Decode(ID),Execute (EX) ,
MEM and Write Operand(WO)stages.
• The IF,ID, MEM and WO stages take 1 clock
cycle each for any instruction.
• The EX stage takes 1 clock cycle for ADD and
SUB instructions,3 clock cycles for MUL
instruction and 6 clock cycles for DIV
instruction respectively.
For the next page instructions --
• What is the number of clock cycles required if
is a non-pipelined processor ?
• What is the number of clock cycles required if
it is a pipelined processor without forwarding

• What is the number of clock cycles required if


it is pipelined processor with forwarding?
Instruction sequence

I1 :MUL R2 ,R0 ,R1


I2 :DIV R5 ,R3 ,R4
I3 :ADD R2 ,R5 ,R2
I4 :SUB R5 ,R2 ,R6

You might also like