Microprocessor Architecture Computer Architecture: Von Neumann Von Neumann vs. Harvard

Computer Architecture
Microprocessor Architecture
in a nutshell
Separation of CPU and memory distinguishes

programmable computer.
CPU fetches instructions from memory.
Registers help out: program counter (PC),
instruction register (IR), general-purpose
registers, etc.
Alternative approaches
Two opposite examples
SHARC
ARM7
Chenyang Lu
CSE 467S
von Neumann
Chenyang Lu
CSE 467S
von Neumann vs. Harvard

von Neumann
Same memory holds data, instructions.
A single set of address/data buses between
CPU and memory
address
memory
200
ADD r5,r1,r3
Chenyang Lu
200
PC
data
Harvard
CPU
Separate memories for data and instructions.

Two sets of address/data buses between
CPU and memory
ADD r5,r1,r3
IR
CSE 467S
Harvard Architecture
program memory
Chenyang Lu
data
CSE 467S
Harvard allows two simultaneous memory

fetches.
Most DSPs use Harvard architecture for
streaming data:
data
address
CSE 467S
von Neumann vs. Harvard
address
data memory
Chenyang Lu
CPU
greater memory bandwidth;

more predictable bandwidth.
IR
PC
Chenyang Lu
CSE 467S
RISC vs. CISC
Microprocessors
Reduced Instruction Set Computer (RISC)

Compact, uniform instructions facilitate pipelining
More lines of code large memory footprint
Allow effective compiler optimization
RISC
ARM7
ARM9
CISC
Pentium
SHARC
(DSP)
von Neumann
Harvard
Complex Instruction Set Computer (CISC)

Many addressing modes and long instructions
High code density
Often require manual optimization of assembly code
for embedded systems
Chenyang Lu
CSE 467S
Digital Signal Processor

= Harvard + CISC
Chenyang Lu
CSE 467S
DSP Optimizations
Streaming data
Need high data throughput
Harvard architecture
Signal processing
Memory footprint
Require high code density
Need CISC instead of RISC
Real-time requirements
Chenyang Lu
CSE 467S
Support floating point operation

Efficient loops (matrix, vector operations)
Ex. Finite Impulse Response (FIR) filters
Execution time must be predictable opportunistic
optimization in general purpose processors may not
work (e.g., caching, branch prediction)
Chenyang Lu
SHARC Architecture
Registers
Modified Harvard architecture.
Register files
10
Data address generator registers.

Loop registers.
Register files connect to:
Two pieces of data can be loaded in parallel
Support for signal processing
multiplier
shifter;
ALU.
Powerful floating point operations

Efficient loop
Parallel instructions
CSE 467S
CSE 467S
40 bit R0-R15 (aliased as F0-F15 for floating point)
Separate data/code memories.

Program memory can be used to store data.
Chenyang Lu
11
Chenyang Lu
CSE 467S
12
Assembly Language
SHARC Assembly
1-to-1 representation of binary instructions

Why do we need to know?
Algebraic notation terminated by semicolon:

R1=DM(M0,I0), R2=PM(M8,I8); ! comment
label: R3=R1+R2;
Performance analysis
Manual optimization of critical code
data memory access
Focus on architecture characteristics
program memory access
NOT specific syntax
Chenyang Lu
CSE 467S
13
Chenyang Lu
Computation
Data Types
Floating point operations

Hardware multiplier
Parallel computation
Chenyang Lu
CSE 467S
15
CSE 467S
32-bit IEEE single-precision floating-point.

40-bit IEEE extended-precision floating-point.
32-bit integers.
48-bit instructions.
Chenyang Lu
CSE 467S
Rounding and Saturation
Parallel Operations
Floating-point can be:
Can issue some computations in parallel:
Rounded toward zero;

Rounded toward nearest.
14
16
dual add-subtract;
multiplication and dual add/subtract
floating-point multiply and ALU operation
ALU supports saturation arithmetic

Overflow results in max value, not rollover.
R6 = R0*R4, R9 = R8 + R12, R10 = R8 - R12;
CLIP Rx within range [-Ry,Ry]

Rn = CLIP Rx by Ry;
Chenyang Lu
CSE 467S
17
Chenyang Lu
CSE 467S
18
Example: Exploit Parallelism
Memory Access
if (a>b) y = c-d; else y = c+d;
Parallel load/store
Circular buffer
Compute both cases, then choose which one to store.

! Load values
R1=DM(_a); R2=DM(_b);
R3=DM(_c); R4=DM(_d);
! Compute both sum and difference
R12 = R2+R4, R0 = R2-R4;
! Choose which one to save
COMP(R1,R2);
IF LE R0 = R12;
DM(_y) = R0 ! Write to y
Chenyang Lu
CSE 467S
19
Chenyang Lu
CSE 467S
Load/Store
Basic Addressing
Load/store architecture
Can use direct addressing
Two set of data address generators (DAGs):
Immediate value:
R0 = DM(0x20000000);
Direct load:
program memory;
data memory.
Can perform two load/store per cycle
R0 = DM(_a); ! Load contents of _a
Direct store:
DM(_a)= R0; ! Stores R0 at _a
Must set up DAG registers to control

loads/stores.
Chenyang Lu
CSE 467S
21
DAG1 Registers
M0
M1
L0
L1
B0
B1
I2
M2
L2
B2
I3
M3
L3
B3
I4
M4
L4
B4
I5
M5
L5
B5
I6
M6
L6
B6
I7
M7
L7
B7
CSE 467S
Chenyang Lu
CSE 467S
22
Post-Modify with Update
I0
I1
Chenyang Lu
20
I register holds base address.

M register/immediate holds modifier value.
R0 = DM(I3,M3) ! Load
DM(I2,1) = R1 ! Store
23
Chenyang Lu
CSE 467S
24
Zero-Overhead Loop
Circular Buffer
L: buffer size
B: buffer base address
I, M in post-modify mode
I is automatically wrapped around the circular
buffer when it reaches B+L
Example: FIR filter
No cost for jumping back to start of loop

Decrement counter, cmp, and jump back
LCNTR=30, DO L UNTIL LCE;

R0=DM(I0,M0), F2=PM(I8,M8);
R1=R0-R15;
F4=F2+F3;
L:
Chenyang Lu
CSE 467S
25
FIR Filter on SHARC
26
PC Stack
! Executed after new sensor data is stored in xnew

R1=DM(_xnew);
! Use post-increment mode
PM(I8,M8)=R1;
! Loop body
LCNTR=4, DO L UNTIL LCE;
! Use post-increment mode
R1=DM(I0,M0), R2=PM(I8,M8);
L: R8=R1*R2, R12=R12+R8;
CSE 467S
CSE 467S
Nested Loop
! Init: Set up circular buffers for x[] and c[].

B8=PM(_x); ! I8 is automatically set to _x
L8=4; ! Buffer size
M8=1; ! Increment of x[]
B0=DM(_c); L0=4; M0=1; ! Set up buffer for c
Chenyang Lu
Chenyang Lu
27
Example: Nested Loop
Loop start address

Return addresses for subroutines
Interrupt service routines
Max depth = 30
Loop Address Stack

Loop end address
Max depth = 6
Loop Counter Stack

Loop counter values
Max depth = 6
Chenyang Lu
CSE 467S
28
SHARC
CISC + Harvard architecture
Computation
S1:
S2:
LCNTR=3, DO LP2 UNTIL LCE;

LCNTR=2, DO LP1 UNTIL LCE;
R1=DM(I0,M0), R2=PM(I8,M8);
LP1: R8=R1*R2;
R12=R12+R8;
LP2: R11=R11+R12;
Floating point operations

Hardware multiplier
Parallel operations
Memory Access
Parallel load/store
Circular buffer
Zero-overhead and nested loop

Chenyang Lu
CSE 467S
29
Chenyang Lu
CSE 467S
30
Microprocessors
ARM7
RISC
ARM7
CISC
Pentium
DSP
(SHARC)
von Neumann
Harvard
von Neumann + RISC

Compact, uniform instruction set
ARM9
32 bit or 12 bit
Usually one instruction/cycle
Poor code density
No parallel operations
Memory access
Chenyang Lu
CSE 467S
No parallel access
No direct addressing
31
Chenyang Lu
CSE 467S
FIR Filter on ARM7
Sample Prices
; loop
MOV
MOV
LDR
MOV
ADR
ADR
ARM7: $14.54
SHARC: $51.46 - $612.74
initiation code
r0, #0 ; use r0 for loop counter
r8, #0 ; use separate index for arrays
r1, #4 ; buffer size
r2, #0 ; use r2 for f
r3, c ; load r3 with base of c[ ]
r5, x ; load r5 with base of x[ ]
32
; loop; instructions for circular buffer are not shown

L: LDR r4, [r3, r8] ; get c[i]
LDR r6, [r5, r8] ; get x[i]
MUL r4, r4, r6 ; compute c[i]x[i]
ADD r2, r2, r4 ; add into sum
ADD r8, r8, #4 ; add one word to array index
ADD r0, r0, #1 ; add 1 to i
CMP r0, r1 ; exit?
BLT L ; if i < 4, continue
Chenyang Lu
CSE 467S
33
Chenyang Lu
CSE 467S
MIPS/FLOPS Metrics
Evaluating DSP Speed
Do not indicate how much work is accomplished

by each instruction.
Implement, manually optimize, and compare complete

application on multiple DSPs
34
Time consuming
Depend on architecture and instruction set.
Benchmarks: a set of small pieces (kernel) of

representative code
Especially unsuitable for DSPs due to the

diversity of architecture and instruction sets.
Ex. FIR filter

Inherent to most embedded systems
Small enough to allow manual optimization on multiple DSPs
Circular buffer load

Zero-overhead loop
Application profile + benchmark testing

Assign relative importance of each kernel
Chenyang Lu
CSE 467S
35
Chenyang Lu
CSE 467S
36
Other Important Metrics
Chenyang Lu
CSE 467S
37
Reading
Chapter 2 (only the sections related to slides)
Optional: J. Eyre and J. Bier, DSP Processors Hit the
Mainstream, IEEE Micro, August 1998.
Optional: More about SHARC
http://www.analog.com/processors/processors/sharc/
Nested loops: Pages (3-37) (3-59)
http://www.analog.com/UploadedFiles/Associated_Docs/476124
543020432798236x_pgr_sequen.pdf
Chenyang Lu
CSE 467S
39
Power consumption
Cost
Code density
Chenyang Lu
CSE 467S
38

Microprocessor Architecture Computer Architecture: Von Neumann Von Neumann vs. Harvard

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microprocessor Architecture Computer Architecture: Von Neumann Von Neumann vs. Harvard

Uploaded by

Copyright:

Available Formats

Computer Architecture

Separation of CPU and memory distinguishes

von Neumann vs. Harvard

Separate memories for data and instructions.

Harvard allows two simultaneous memory

von Neumann vs. Harvard

greater memory bandwidth;

RISC vs. CISC

Reduced Instruction Set Computer (RISC)

Complex Instruction Set Computer (CISC)

Digital Signal Processor

Support floating point operation

Modified Harvard architecture.

Data address generator registers.

Two pieces of data can be loaded in parallel

Support for signal processing

Powerful floating point operations

40 bit R0-R15 (aliased as F0-F15 for floating point)

Separate data/code memories.

1-to-1 representation of binary instructions

Algebraic notation terminated by semicolon:

data memory access

Focus on architecture characteristics

program memory access

NOT specific syntax

Floating point operations

32-bit IEEE single-precision floating-point.

Rounding and Saturation

Floating-point can be:

Can issue some computations in parallel:

Rounded toward zero;

ALU supports saturation arithmetic

R6 = R0*R4, R9 = R8 + R12, R10 = R8 - R12;

CLIP Rx within range [-Ry,Ry]

Example: Exploit Parallelism

if (a>b) y = c-d; else y = c+d;

Compute both cases, then choose which one to store.

R0 = DM(_a); ! Load contents of _a

Must set up DAG registers to control

Post-Modify with Update

I register holds base address.

No cost for jumping back to start of loop

LCNTR=30, DO L UNTIL LCE;

FIR Filter on SHARC

! Executed after new sensor data is stored in xnew

! Init: Set up circular buffers for x[] and c[].

Example: Nested Loop

Loop start address

Loop Address Stack

Loop Counter Stack

LCNTR=3, DO LP2 UNTIL LCE;

Floating point operations

Zero-overhead and nested loop

von Neumann + RISC

FIR Filter on ARM7

; loop; instructions for circular buffer are not shown

Evaluating DSP Speed

Do not indicate how much work is accomplished

Implement, manually optimize, and compare complete

Depend on architecture and instruction set.

Benchmarks: a set of small pieces (kernel) of

Especially unsuitable for DSPs due to the

Ex. FIR filter