HJ94 Slides 8 DSP

DSP Processors – Lecture 8
Fundamentals
Ingrid Verbauwhede
Departement Elektrotechniek, afdeling ESAT/COSIC
Iverbauw@esat.kuleuven.ac.be
1
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Motivation
• Architecture exploration
• Specification: MATLAB, SPW, C/C++, Java
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
ASIC Special Retargetable DSP processors DSP extensions

Purpose coprocessor to RISC
Bit parallel
(Bit serial)
(Art Designer) (Target compiler (TI TMS320C54x, (Gezel,
technologies) TMS320C55x, Tensilica)
ADI Blackfin, etc. )
2
1
References
• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
3
DSP Processor Fundamentals
Processor Components:
Data Path Interconnect

Processing Processing
Unit Unit
Instruction Memory
Processing Management
Unit Unit
4
2
Von Neumann machine
One memory space
Processor
Core mpy ALU
Address Bus
Data Bus
Memory
5
FIR implementation
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
Σ c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Execute row by row

6
3
FIR on Von Neumann
Assume Von Neumann has multiply and accumulate instruction

(not necessarily the case)
Assume also that pipelining allows to execute the multiply and accumulate
in parallel with the read or write operations.
Then one tap needs 4 cycles:
1. read multiply-accumulate instruction
2. read data value from memory
3. read coefficient from memory
4. write data value to the next location in the delay line
(because for the next sample, all values are shifted by one location)
Memory bandwidth is crucial !!!
7
Basic Harvard Architecture
Separate data memory from program memory!
Program Data
Memory Memory
Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate
ALU
8
4
Example 1: TMS320C10 (1982)
Data RAM Program ROM

144 x 16 1.5K x 16 160/200ns Instruction
A (11-0)
cycle time
4K word external
PA (7-0)
D (15-0) (A 2-0, D 15-0) address reach
CPU 60 general purpose and
16-bit T-register DSP specific instructions
16-bit Barrel I/O Ports
16 x 16 Multiply 8 x 16
Shifter (L) Single cycle multiply
32-bit P-register
32-bit ALU 16-bit Barrel Shifter
32-bit Accumulator
External interrupt and
ShiftL (0,1,4)
polled input pins
2 Auxiliary Regs
Four Level H/W Stack Eight 16-bit I/O ports
Status Register
40-pin DIP/44-pin PLCC
Courtesy: Texas Instruments

9
TMS320C1x Example - Sum of Products
Data Bus Compute Y = AX1 + BX2 + CX3 + DX4
ZAC ACC=0
T (16)
LT X1 T=X1
MPY A P=AX1
Multiplier
LTA X2 ACC=AX1;T=X2
MPY B P=BX2
P (32) LTA X3 ACC=AX1+BX2;T=X3
MPY C P=CX3
MUX LTA X4 ACC=AX1+BX2+CX3;T=X4
MPY D P=DX4
APAC ACC=AX1+BX2+Cx3+DX4
ALU (32)
SACH Y1 STORE 32-BIT RESULT
ACC (32) SACH Y2 AT LOCATIONS Y1, Y2
• 50 taps = 103 cycles

• = Program ROM of 103 instructions
10
5
TMS320C1x Memory and Buses
Data Program Single cycle reads

RAM
and writes
ROM EPROM OTP
256x16 8Kx16 Modified Harvard
Architecture
16 8 16 16 - Separate Program
Program Address and Data Buses
A15-A0,
PA2-PA0
- "Bridge" between
Program and Data
Program Data Space
D15-D0
16
MUX
DEN Up to 8K words of
Data Data on-chip Program ROM
MEN
WE
4K words of
Data Address EPROM
16 16
and OTP available
16 8
Up to 64K words
Program Control, CPU External Program
Instruction Register Memory
Courtesy: Texas Instruments 11

Modified Harvard Architecture
Program Data
Memory Memory
Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate
ALU
Program bus to get instruction

Or to get coefficients (often stored in ROM)
12
6
Same FIR: 53 cycles, 3 prog words
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
N-1 (50 TAPS)
y(n) =
Σ c(i) x(n-i) c(0) X X X c(N-1) X
i=0
y(n)
+ + +
Single Cycle Multiply - Accumulate!
TMS320C10 TMS320C25
LT LTD RPTK 49 LT
DMOV MPY MACD DMOV
APAC LTD APAC
MPY 53 Cycles MPY
LTD 3 Words Prog Memory
..
. 100 Cycles
MPY
100 Words Prog Memory
13
Example: MACD
MACD = Multiply by Program Memory and Accumulate with Delay

(Instruction is still present in C54x and C55x)
MACD Smem, pmad, src

Smem = data memory
pmad = program address
src = accumulator (A or B)
Executes (simplified):
(Smem) x (Pmem(at location pmad)) + src -> src ; = multiply – accumulate

(Smem) -> Treg ; load data in Treg register
(Smem) -> Smem +1 ; load data in next mem loc.
(pmad) +1 -> pmad ; increment program address
pointer
When executing with a repeat instruction, takes one cycle

14
7
Single Cycle MAC
TMS320C2x Multiplier/ALU
Program Bus
Single Cycle 16x16 bit
Data Bus 16
16 16 16 Multiply yielding a
Left T Register (16) MUX 32-bit product
Shifter 16
(0-16) 16
Multiplier (16x16)
32 Supports simultaneous
P Register (32) Program and two Data
32
Left Shifter (0-16) Operand acquisition
32 32
MUX Supports simultaneous
32
32
ALU and Multiplier
Arithmetic Logic Unit (ALU)
32 operations
C Accumulator Register (32)
32 0-16 bit Left Post-Shifter
16
Left Shifter (0-7)
Courtesy: Texas Instruments 15
TMS320C2x Enhancements Over C1x
1986:
80/100ns instruction cycle time
Simultaneous single-cycle Multiply/ALU operations
Zero overhead repeat single instruction
64K words of off-chip Data RAM
Optimizing ANSI C-Compiler
544 words of on-chip Data/Program RAM
Multiplier Post Shifter and enhanced Accumulator Post Shifter
74 additional instructions
- Single-cycle MAC and zero overhead repeat
- Long immediate and carry bit support
- More logical and conditional branch operations
- Data block move support
Bit reversed addressing for FFTs
Eight auxiliary registers
Hardware wait states
DMA support
Idle and Powerdown Capability
Courtesy: Texas Instruments

16
8
Other memory configurations
Program Data Data Multiple data memories

Memory Memory Memory e.g. Motorola 56000:
- program memory
- X memory
- Y memory
Program Program/ Data Data
Cache Memory Memory
Instruction cache
• single instruction RPTK (repeat in TMS320C2x))
• a few instructions (up to 15 in AT&T 16A)
• ALWAYS under programmers control!
• ALWAYS known at compile time!
17
Memory configurations (more)
• Very cost sensitive applications

• all memory ON chip (even in the 80’s!)
• multiple small memories instead of unpredictable memory cache hierarchy
• program memory mostly ROM (now Flash Memory)
• Programmer decides the distribution of arrays over the memories
to make sure that the two parallel reads are from different memory banks!
• More fancy stuff:

• special instructions to move samples in a delay line
• circular buffers for delay lines
18
9
Block Diagram (C54x)
• Memory Access
– 4 internal bus pairs
– C,D for data read
– E for data write
– P for program
• Others
– 2 40-bit Accum.
– 40-bit Barrel shifter
– 40-bit ALU
– 17bx17b multiplier
and 40b dedicated
adder perform a
non pipelined
single-cycle MAC
19
Addressing modes
• 216 memory locations

• only 16 bit instruction width means only one immediate address
• most processors: immediate address is two instruction words
• MOST used: register – indirect addressing

• very compact
• very useful for accessing consecutive memory locations in a
repetitive mode
• Needs:
• special address registers
• associated Address calculation units
• operate in parallel
• as many ACU’s as memories
20
10
Indirect addressing:
r1 = address of last word in the delay line
r2 = address of last coefficient
r3 = address of last word in the delay line
a1 = new input sample
a0 = *r1-- x *r2--;
Repeat 47 times
a0 = a0 + (*r3--=*r1--)x *r2--;
a1 = a0 + (*r3=a1) x *r2;
Read a1
a1
write with r3
x[0] x[n-(N-1)]
read with r1
*r1-- = read memory location of which address is stored in r1

decrement the contents of r1 (post modification)
21
Modulo addressing= circular buffers
Moving samples around:

• requires memory bandwidth (extra write operation)
• extra power consumption
Therefore: circular buffers Read x[n-(N-1)]

• pointers move in a circle Write new x
Read x[n-(N-2)]
Read x[0]
will become x[n-(N-1)]
Will become x[1]
• requires special ACU

with start and end location
of circular buffer in memory
and special logic to test boundaries.
22
11
Circular buffer (cont.)
• Example (C54x)
– BK = buffer size (e.g. 6 = 0110, 7 locations)
– Start at location with xxxx 0000 (4 LSB’s have to be zero)
• used for sliding window type operations: convolution,
correlation, FIR filters, etc.
*+AR0(0)% ;AR0 =0 (1st value)

0
*+AR0(5)% ;AR0 =5 (2nd value)
1
2 *+AR0(2)% ;AR0 =1 (3th value)
3 *+AR0(-3)% ;AR0 =4 (4th value)
4
5 *+AR0(6)% ;AR0 =4 (5th value)
6
23
Mobile Wireless Trends

S u b sc rib e rs in (0 0 0)
1,600 ,0 00
1,400 ,0 00
W ire lin e C A G R - 5 % G loba l W ireline

1,200 ,0 00 G lo b a l P e n e tra tio n (2 0 1 0 ) - 2 0 % G o bal W ireless
1,000 ,0 00
Subscribers (000)
800 ,0 00
600 ,0 00
400 ,0 00
W ire le s s C A G R 2 1 %
G lo b a l P e n e tra tio n (2 0 1 0 ) - 2 1 %
200 ,0 00
(C e llu la r+ P C S + W L A S + O th e r)
G lo b al P o p - 7 b ill
C AG R 1995 -20 10 - 1.4%
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
World-wide deployment of mobile communications is exceeding expectations

24
12
DSP Evolution and Markets
Disk
DSP Market $270 M Cellular
Infrastructure
Other
$2B market, 30% growth rate
Wireless Mobile Handsets
$1.01B Cordless
Modem
GPS
V.34 $727 M
Source: Forward Concepts 1996
V.90
xDSL Consumer &
Automotive
M68000 ($200)
10K
Power Power
80286 ($200)
(mw/MIP) 1K 80386 ($300)
DSP-1 ($150) (mw/MIP)
Pentium ($300)
DSP-32C ($250)
100
Pentium (MMX)
($700)
10 DSP16A ($15) DSP1600 (<$10)
DSP16210
1
1980 1985 1990 1995 2000
25
The DSP Market Splits
Today’s
general purpose
assembly coded
Mobile Terminals DSP
Infrastructure
• 100 MOPS
Low cost, High
• 250 mW
low power • $40 Performance
DSPs DSPs
• 200-1000 MOPS • 1-10 GOPS

• < 100 mW • 1-5 watts
• $10 • < $50
26
13
Motivation
• Floating point
• Fixed point

Bit parallel
(Bit serial)
27
References
• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
More references:
• P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and
Media Processing,” IEEE Signal Processing Magazine, March 1998,
pg. 59-85, (download from the INSPEC webpage).
• I. Verbauwhede, M. Touriguian, “Wireless Digital Signal Processors,”
Chapter 11 in Digital Signal Processing for Multimedia Systems,
Eds. By K. Parhi, T. Nishitani, Marcel Dekker, Inc.
• C. Nicol, I. Verbauwhede, “DSP Architectures for Next Generation wireless
communications,” ISSCC 2000 tutorial.
28
14
Recall: Memory architecture
FIR execution on:

• Von Neumann: 3 cycles/tap
• Basic Harvard: 2 cycles/tap
• Modified Harvard & repeat loop: 1 cycle per tap & only 3 instructions
Key issues:
• Memory bandwidth by multiple memory banks or multi port memories
• Every memory has its OWN address generation unit
operating in parallel
• Special instructions that combine operations with memory moves:
MACD
• Indirect addressing: *r1++ or *r2--
• circular buffers: extra hardware in the address generation units
FASTER THAN 1 CYCLE PER TAP??
29
Compute Intensive function 1: FIR (cont.)

x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
Σ c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
One output = 2N reads, N MAC’s, 1 write
Classic Harvard: one output = N cycles

30
15
FIR speed-up
FIR filtering: two outputs in parallel
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
Two outputs = 4N reads, 2N MAC’s, 2 writes

Dual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers
Solution by Lucent 16000 core with dual MAC
Run MAC at double frequency, read two 32-bit numbers
Solution by Matsushita
Insert delay register
Solution by Atmel’s LODE 31
Example 3: Lucent DSP16210

XDB(32)
Inner loop of 32-tap FIR Filter IDB(32)
do 14 { //one instruction ! Y(32) X(32)

a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
16 x 16 mpy 16 x 16 mpy
}
p0 (32) p1 (32)
Outer Loop: 19 cycles, 38 bytes
Shift/Sat. Shift/Sat.
1 cycle in inner loop
5 exec units used in inner loop
2 MACs per cycle
ALU ADD BMU
Horizontal parallelism, one sample at
a time
2G mobile wireless base-stations
ACC File
8 x 40
Courtesy: Gareth Hughes, Bell Labs Australia

32
16
FIR on Lode
FIR filter: two outputs in parallel with delay register

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
Total energy for one output sample:
Energy Single Dual Dual MAC

MAC MAC with REG
No. of MAC operations N N N
No of Memory reads 2N 2N N
No of Instruction Cycles N N/2 N/2
33
FIR on Lode
Two MAC units with dedicated bus network
DB1(16)
DB0(16)
x(n-i+1) x(n-i)
LREG c(i)
• DB0 fetches coefficient c(i)
• DB1 fetches data
X X
• LREG delays input data
MAC1 MAC0
• A0 stores y(n) output + +
• A1 stores y(n+1) output
y(n+1) A0 y(n) A1
Same structure can be used for IIR
34
17
Arithmetic
DSP processors come in two flavors:
• floating point
• most popular one: Sharc’s from Analog Devices
• fixed point
• usually 16 bit, sometimes 24 bit (audio processors)
• newer processors might have wider data paths or registers
(TI C6x: 16x16 mpy, 32 bit registers, 40 bit ALU)
16 x 16 mpy
Basic
32 bit
datapath
ALU 40 bit
40 bit
shifter
Select 16 bit
35
Overflow:
• Saturation logic combined with output shifter
16 x 16 mpy
32 bit
ALU 40 bit
40 bit
Shifter/ saturate
Select 16 bit
• How to implement saturation?
36
18
Overflow:
• Input shifter: scaling, line up of the inputs

= loss of precision if shift to much down.
16 x 16 mpy
input Shifter
32 bit
ALU 40 bit
40 bit
Shifter/ saturate
Select 16 bit
37
Block normalization
• Often used in speech coders because dynamic range of the

input signals is unknown.
• Scale the whole array of values such that the maximum entry
sits in the range [0.5, 1)
• minimum loss of precision
TIC54x:
EXP A <- counts number of sign bits, stores this number in TREG
NORM A <- shifts the accumulator by the number of bits in TREG
Lode:
Repeat N;
A3 = expmn (*r0), r0++; (stores # of sign bits in special register ASR)
Repeat N;
*r0 = *r0 < ASR, r0++;
38
19
Pipelining:
Time
Fetch Decode Memory Execute

Access

Access

Access
Fetch = fetch instruction

Decode = decode instruction
Memory access = address generation and read operands
Execute = perform operation
39
Pipelining
How does pipeline appears to the programmer?
Lee’s paper (part II) discusses 3 variations
(the difference is often blurry):
• interlocking
• time stationary coding
• data stationary coding
Interlocking: the instructions appear if executed one after another
40
20
Interlocking on C10
LT Fetch Decode Memory Execute
Access
MPY Decode Memory Execute

Fetch
Access
LTD Fetch Decode Memory Execute

Access
MPY Fetch Decode Memory Execute

Access
Reservation table:
PMEM LT MPY LTD MPY LTD MPY
DMEM data coef1 data coef2 ...
MPY
ALU
41
Interlocking on C2x
Programmer does not know the pipeline

If an access conflict occurs: hardware will “stall” and finish one (part) of an
Instruction before finishing a second part.
RPTK 49
MACD
Reservation table:
PMEM RPTK MACD coef1 coef2 coef3
DMEM data1 data2 ...
MPY
ALU
42
21
Time stationary
Instruction specifies “one instruction cycle”.

So it specifies, all that occurs in parallel.

Access

Access

Access

Access
Example:
Motorola:
MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0
(multiply-acc of values read from memory in the previous cycle
Lucent 16x
a0 = a0 + p, p = x * y, y = *r0++, x = *pt ++
43
Data stationary
Time stationary: working on different samples in one instruction
Data stationary: describes what happens with one input data from
start to end.
Example (Lode):
*r3++ = a0+ = a2 * *r2++;

(read from memory with pointer reg r2,
Multiply with a2, add to a0 and store back in a0,
Store the result in memory with pointer r3,
Post modify r2 and r3)
Fetch Decode Read Execute Write
44
22
Control & Pipeline for DSP’s
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)
Memory Write
Fetch Decode Execute Access Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!
DSP: register-memory architecture (TI, Lucent, HX, Lode)
Fetch Decode Memory Execute Write

Access Back
Execution
Memory access
Excellent for number crunching!

45
Pipeline RISC compared to DSP

RISC:example r0 = *p0; // load data
a0 = a0 + r0; // execute
Memory
Fetch Decode Execute Access Too expensive for DSP
Memory
Fetch Decode Execute Access
Memory
Fetch Decode Execute Access
DSP: memory intensive applications:

Memory Execute
Fetch Decode Access
Memory Execute
Fetch Decode Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode Access
Penalty: data dependent branch is expensive

46
23
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized
assembler - large development effort with minimal tool support
HLL hand coded optimize & debug

assembler prototype production
algorithmic
code code
model
Long, frustrating time to market

Fragile legacy code
Widely used in handhelds, but change in basestations Part II

47
Lode Core Architecture
48
24
Domain specific instruction set
Basic instruction set for general purpose DSP

e.g. MAC, min, max, etc.
Extra instructions for performance with every new generation

e.g. “square distance and accumulate
N-1
D= Σ || x(i) - y(i) ||2
i=0
One 32 bit instruction:

a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;
Bus network and instruction set design go together
CISC, thus compiler unfriendly

49
Other control features
Hardware looping:
• Because software branch is expensive
• “Zero overhead hardware loops” (for tight FIR loops)
hardware supported
Interrupts: hardware with shadow registers for extremely fast

context switching.
Special instruction cache:

• Single instruction “repeat” buffer
• Multiple instruction cache: under programmers control!
• E.g. Lucent DSP16210:31x 32 instruction cache
Predictable worst case execution time!
50
25
Motivation
• Floating point
• Fixed point

Bit parallel
(Bit serial)
51
26

HJ94 Slides 8 DSP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HJ94 Slides 8 DSP

Uploaded by

Copyright:

Available Formats

DSP Processors – Lecture 8

Departement Elektrotechniek, afdeling ESAT/COSIC

• Specification: MATLAB, SPW, C/C++, Java

ASIC Special Retargetable DSP processors DSP extensions

DSP Processor Fundamentals

Data Path Interconnect

One memory space

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

Execute row by row

Assume Von Neumann has multiply and accumulate instruction

Memory bandwidth is crucial !!!

Basic Harvard Architecture

Separate data memory from program memory!

Data RAM Program ROM

Courtesy: Texas Instruments

TMS320C1x Example - Sum of Products

Data Bus Compute Y = AX1 + BX2 + CX3 + DX4

ACC (32) SACH Y2 AT LOCATIONS Y1, Y2

• 50 taps = 103 cycles

Data Program Single cycle reads

Courtesy: Texas Instruments 11

Modified Harvard Architecture

Program bus to get instruction

Single Cycle Multiply - Accumulate!

MACD = Multiply by Program Memory and Accumulate with Delay

MACD Smem, pmad, src

(Smem) x (Pmem(at location pmad)) + src -> src ; = multiply – accumulate

When executing with a repeat instruction, takes one cycle

TMS320C2x Enhancements Over C1x

Courtesy: Texas Instruments

Program Data Data Multiple data memories

Memory configurations (more)

• Very cost sensitive applications

• More fancy stuff:

• 216 memory locations

• MOST used: register – indirect addressing

*r1-- = read memory location of which address is stored in r1

Modulo addressing= circular buffers

Moving samples around:

Therefore: circular buffers Read x[n-(N-1)]

• requires special ACU

*+AR0(0)% ;AR0 =0 (1st value)

Mobile Wireless Trends

W ire lin e C A G R - 5 % G loba l W ireline

World-wide deployment of mobile communications is exceeding expectations

10 DSP16A ($15) DSP1600 (<$10)

The DSP Market Splits

• 200-1000 MOPS • 1-10 GOPS

• Specification: MATLAB, SPW, C/C++, Java

ASIC Special Retargetable DSP processors DSP extensions

FIR execution on:

FASTER THAN 1 CYCLE PER TAP??

Compute Intensive function 1: FIR (cont.)

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

One output = 2N reads, N MAC’s, 1 write

Classic Harvard: one output = N cycles

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

Two outputs = 4N reads, 2N MAC’s, 2 writes

Example 3: Lucent DSP16210

do 14 { //one instruction ! Y(32) X(32)

Courtesy: Gareth Hughes, Bell Labs Australia

FIR filter: two outputs in parallel with delay register

Total energy for one output sample:

r3++ = a0+ = a2 *r2++;