You are on page 1of 26

DSP Processors – Lecture 8

Fundamentals

Ingrid Verbauwhede

Departement Elektrotechniek, afdeling ESAT/COSIC

Iverbauw@esat.kuleuven.ac.be

1
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Motivation
• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

ASIC Special Retargetable DSP processors DSP extensions


Purpose coprocessor to RISC
Bit parallel
(Bit serial)
(Art Designer) (Target compiler (TI TMS320C54x, (Gezel,
technologies) TMS320C55x, Tensilica)
ADI Blackfin, etc. )
2
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

1
References

• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14

• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.

3
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

DSP Processor Fundamentals

Processor Components:

Data Path Interconnect


Processing Processing
Unit Unit

Instruction Memory
Processing Management
Unit Unit

4
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

2
Von Neumann machine

One memory space

Processor
Core mpy ALU

Address Bus

Data Bus

Memory

5
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

FIR implementation
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
Σ c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);


y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Execute row by row


6
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

3
FIR on Von Neumann

Assume Von Neumann has multiply and accumulate instruction


(not necessarily the case)
Assume also that pipelining allows to execute the multiply and accumulate
in parallel with the read or write operations.
Then one tap needs 4 cycles:
1. read multiply-accumulate instruction
2. read data value from memory
3. read coefficient from memory
4. write data value to the next location in the delay line
(because for the next sample, all values are shifted by one location)

Memory bandwidth is crucial !!!

7
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Basic Harvard Architecture

Separate data memory from program memory!

Program Data
Memory Memory

Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate

ALU

8
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

4
Example 1: TMS320C10 (1982)

Data RAM Program ROM


144 x 16 1.5K x 16 160/200ns Instruction
A (11-0)
cycle time
4K word external
PA (7-0)
D (15-0) (A 2-0, D 15-0) address reach
CPU 60 general purpose and
16-bit T-register DSP specific instructions
16-bit Barrel I/O Ports
16 x 16 Multiply 8 x 16
Shifter (L) Single cycle multiply
32-bit P-register
32-bit ALU 16-bit Barrel Shifter
32-bit Accumulator
External interrupt and
ShiftL (0,1,4)
polled input pins
2 Auxiliary Regs
Four Level H/W Stack Eight 16-bit I/O ports
Status Register
40-pin DIP/44-pin PLCC

Courtesy: Texas Instruments


9
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

TMS320C1x Example - Sum of Products

Data Bus Compute Y = AX1 + BX2 + CX3 + DX4

ZAC ACC=0
T (16)
LT X1 T=X1

MPY A P=AX1
Multiplier
LTA X2 ACC=AX1;T=X2

MPY B P=BX2
P (32) LTA X3 ACC=AX1+BX2;T=X3

MPY C P=CX3
MUX LTA X4 ACC=AX1+BX2+CX3;T=X4

MPY D P=DX4

APAC ACC=AX1+BX2+Cx3+DX4
ALU (32)
SACH Y1 STORE 32-BIT RESULT

ACC (32) SACH Y2 AT LOCATIONS Y1, Y2

• 50 taps = 103 cycles


• = Program ROM of 103 instructions
10
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

5
TMS320C1x Memory and Buses

Data Program Single cycle reads


RAM
and writes
ROM EPROM OTP
256x16 8Kx16 Modified Harvard
Architecture
16 8 16 16 - Separate Program
Program Address and Data Buses
A15-A0,
PA2-PA0
- "Bridge" between
Program and Data
Program Data Space
D15-D0
16
MUX

DEN Up to 8K words of
Data Data on-chip Program ROM
MEN
WE
4K words of
Data Address EPROM
16 16
and OTP available
16 8
Up to 64K words
Program Control, CPU External Program
Instruction Register Memory

Courtesy: Texas Instruments 11


HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Modified Harvard Architecture

Program Data
Memory Memory

Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate

ALU

Program bus to get instruction


Or to get coefficients (often stored in ROM)

12
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

6
Same FIR: 53 cycles, 3 prog words

x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
N-1 (50 TAPS)
y(n) =
Σ c(i) x(n-i) c(0) X X X c(N-1) X
i=0

y(n)
+ + +

Single Cycle Multiply - Accumulate!

TMS320C10 TMS320C25
LT LTD RPTK 49 LT
DMOV MPY MACD DMOV
APAC LTD APAC
MPY 53 Cycles MPY
LTD 3 Words Prog Memory
..
. 100 Cycles
MPY
100 Words Prog Memory
13
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Example: MACD

MACD = Multiply by Program Memory and Accumulate with Delay


(Instruction is still present in C54x and C55x)

MACD Smem, pmad, src


Smem = data memory
pmad = program address
src = accumulator (A or B)

Executes (simplified):

(Smem) x (Pmem(at location pmad)) + src -> src ; = multiply – accumulate


(Smem) -> Treg ; load data in Treg register
(Smem) -> Smem +1 ; load data in next mem loc.
(pmad) +1 -> pmad ; increment program address
pointer

When executing with a repeat instruction, takes one cycle


14
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

7
Single Cycle MAC

TMS320C2x Multiplier/ALU
Program Bus
Single Cycle 16x16 bit
Data Bus 16
16 16 16 Multiply yielding a
Left T Register (16) MUX 32-bit product
Shifter 16
(0-16) 16
Multiplier (16x16)
32 Supports simultaneous
P Register (32) Program and two Data
32
Left Shifter (0-16) Operand acquisition
32 32
MUX Supports simultaneous
32
32
ALU and Multiplier
Arithmetic Logic Unit (ALU)
32 operations
C Accumulator Register (32)
32 0-16 bit Left Post-Shifter
16
Left Shifter (0-7)
Courtesy: Texas Instruments 15
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

TMS320C2x Enhancements Over C1x

1986:
80/100ns instruction cycle time
Simultaneous single-cycle Multiply/ALU operations
Zero overhead repeat single instruction
64K words of off-chip Data RAM
Optimizing ANSI C-Compiler
544 words of on-chip Data/Program RAM
Multiplier Post Shifter and enhanced Accumulator Post Shifter
74 additional instructions
- Single-cycle MAC and zero overhead repeat
- Long immediate and carry bit support
- More logical and conditional branch operations
- Data block move support
Bit reversed addressing for FFTs
Eight auxiliary registers
Hardware wait states
DMA support
Idle and Powerdown Capability

Courtesy: Texas Instruments


16
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

8
Other memory configurations

Program Data Data Multiple data memories


Memory Memory Memory e.g. Motorola 56000:
- program memory
- X memory
- Y memory
Program Program/ Data Data
Cache Memory Memory

Instruction cache
• single instruction RPTK (repeat in TMS320C2x))
• a few instructions (up to 15 in AT&T 16A)
• ALWAYS under programmers control!
• ALWAYS known at compile time!

17
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Memory configurations (more)

• Very cost sensitive applications


• all memory ON chip (even in the 80’s!)
• multiple small memories instead of unpredictable memory cache hierarchy
• program memory mostly ROM (now Flash Memory)
• Programmer decides the distribution of arrays over the memories
to make sure that the two parallel reads are from different memory banks!

• More fancy stuff:


• special instructions to move samples in a delay line
• circular buffers for delay lines

18
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

9
Block Diagram (C54x)

• Memory Access
– 4 internal bus pairs
– C,D for data read
– E for data write
– P for program
• Others
– 2 40-bit Accum.
– 40-bit Barrel shifter
– 40-bit ALU
– 17bx17b multiplier
and 40b dedicated
adder perform a
non pipelined
single-cycle MAC

19
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Addressing modes

• 216 memory locations


• only 16 bit instruction width means only one immediate address
• most processors: immediate address is two instruction words

• MOST used: register – indirect addressing


• very compact
• very useful for accessing consecutive memory locations in a
repetitive mode

• Needs:
• special address registers
• associated Address calculation units
• operate in parallel
• as many ACU’s as memories

20
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

10
Indirect addressing:
r1 = address of last word in the delay line
r2 = address of last coefficient
r3 = address of last word in the delay line
a1 = new input sample
a0 = *r1-- x *r2--;
Repeat 47 times
a0 = a0 + (*r3--=*r1--)x *r2--;
a1 = a0 + (*r3=a1) x *r2;
Read a1

a1
write with r3
x[0] x[n-(N-1)]

read with r1

*r1-- = read memory location of which address is stored in r1


decrement the contents of r1 (post modification)
21
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Modulo addressing= circular buffers

Moving samples around:


• requires memory bandwidth (extra write operation)
• extra power consumption

Therefore: circular buffers Read x[n-(N-1)]


• pointers move in a circle Write new x
Read x[n-(N-2)]
Read x[0]
will become x[n-(N-1)]
Will become x[1]

• requires special ACU


with start and end location
of circular buffer in memory
and special logic to test boundaries.

22
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

11
Circular buffer (cont.)

• Example (C54x)
– BK = buffer size (e.g. 6 = 0110, 7 locations)
– Start at location with xxxx 0000 (4 LSB’s have to be zero)
• used for sliding window type operations: convolution,
correlation, FIR filters, etc.

*+AR0(0)% ;AR0 =0 (1st value)


0
*+AR0(5)% ;AR0 =5 (2nd value)
1
2 *+AR0(2)% ;AR0 =1 (3th value)
3 *+AR0(-3)% ;AR0 =4 (4th value)
4
5 *+AR0(6)% ;AR0 =4 (5th value)
6

23
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Mobile Wireless Trends


S u b sc rib e rs in (0 0 0)
1,600 ,0 00

1,400 ,0 00

W ire lin e C A G R - 5 % G loba l W ireline


1,200 ,0 00 G lo b a l P e n e tra tio n (2 0 1 0 ) - 2 0 % G o bal W ireless

1,000 ,0 00
Subscribers (000)

800 ,0 00

600 ,0 00

400 ,0 00
W ire le s s C A G R 2 1 %
G lo b a l P e n e tra tio n (2 0 1 0 ) - 2 1 %
200 ,0 00
(C e llu la r+ P C S + W L A S + O th e r)
G lo b al P o p - 7 b ill
C AG R 1995 -20 10 - 1.4%
0
1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

World-wide deployment of mobile communications is exceeding expectations


24
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

12
DSP Evolution and Markets
Disk
DSP Market $270 M Cellular
Infrastructure
Other
$2B market, 30% growth rate
Wireless Mobile Handsets
$1.01B Cordless
Modem
GPS
V.34 $727 M
Source: Forward Concepts 1996
V.90
xDSL Consumer &
Automotive

M68000 ($200)
10K
Power Power
80286 ($200)
(mw/MIP) 1K 80386 ($300)
DSP-1 ($150) (mw/MIP)
Pentium ($300)
DSP-32C ($250)
100
Pentium (MMX)
($700)

10 DSP16A ($15) DSP1600 (<$10)

DSP16210
1
1980 1985 1990 1995 2000
25
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

The DSP Market Splits

Today’s
general purpose
assembly coded
Mobile Terminals DSP
Infrastructure
• 100 MOPS
Low cost, High
• 250 mW
low power • $40 Performance
DSPs DSPs

• 200-1000 MOPS • 1-10 GOPS


• < 100 mW • 1-5 watts
• $10 • < $50

26
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

13
Motivation
• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

ASIC Special Retargetable DSP processors DSP extensions


Purpose coprocessor to RISC
Bit parallel
(Bit serial)
(Art Designer) (Target compiler (TI TMS320C54x, (Gezel,
technologies) TMS320C55x, Tensilica)
ADI Blackfin, etc. )
27
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

References

• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14

• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.

More references:
• P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and
Media Processing,” IEEE Signal Processing Magazine, March 1998,
pg. 59-85, (download from the INSPEC webpage).
• I. Verbauwhede, M. Touriguian, “Wireless Digital Signal Processors,”
Chapter 11 in Digital Signal Processing for Multimedia Systems,
Eds. By K. Parhi, T. Nishitani, Marcel Dekker, Inc.
• C. Nicol, I. Verbauwhede, “DSP Architectures for Next Generation wireless
communications,” ISSCC 2000 tutorial.
28
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

14
Recall: Memory architecture

FIR execution on:


• Von Neumann: 3 cycles/tap
• Basic Harvard: 2 cycles/tap
• Modified Harvard & repeat loop: 1 cycle per tap & only 3 instructions

Key issues:
• Memory bandwidth by multiple memory banks or multi port memories
• Every memory has its OWN address generation unit
operating in parallel
• Special instructions that combine operations with memory moves:
MACD
• Indirect addressing: *r1++ or *r2--
• circular buffers: extra hardware in the address generation units

FASTER THAN 1 CYCLE PER TAP??

29
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Compute Intensive function 1: FIR (cont.)


x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
Σ c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);


y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

One output = 2N reads, N MAC’s, 1 write

Classic Harvard: one output = N cycles


30
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

15
FIR speed-up
FIR filtering: two outputs in parallel

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);


y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Two outputs = 4N reads, 2N MAC’s, 2 writes


Dual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers
Solution by Lucent 16000 core with dual MAC
Run MAC at double frequency, read two 32-bit numbers
Solution by Matsushita
Insert delay register
Solution by Atmel’s LODE 31
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Example 3: Lucent DSP16210


XDB(32)
Inner loop of 32-tap FIR Filter IDB(32)

do 14 { //one instruction ! Y(32) X(32)


a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
16 x 16 mpy 16 x 16 mpy
}
p0 (32) p1 (32)
Outer Loop: 19 cycles, 38 bytes
Shift/Sat. Shift/Sat.
1 cycle in inner loop
5 exec units used in inner loop
2 MACs per cycle
ALU ADD BMU
Horizontal parallelism, one sample at
a time
2G mobile wireless base-stations
ACC File
8 x 40

Courtesy: Gareth Hughes, Bell Labs Australia


32
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

16
FIR on Lode

FIR filter: two outputs in parallel with delay register


y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Total energy for one output sample:

Energy Single Dual Dual MAC


MAC MAC with REG
No. of MAC operations N N N

No of Memory reads 2N 2N N

No of Instruction Cycles N N/2 N/2

33
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

FIR on Lode

Two MAC units with dedicated bus network

DB1(16)
DB0(16)
x(n-i+1) x(n-i)
LREG c(i)
• DB0 fetches coefficient c(i)
• DB1 fetches data
X X
• LREG delays input data
MAC1 MAC0
• A0 stores y(n) output + +
• A1 stores y(n+1) output
y(n+1) A0 y(n) A1

Same structure can be used for IIR

34
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

17
Arithmetic
DSP processors come in two flavors:
• floating point
• most popular one: Sharc’s from Analog Devices
• fixed point
• usually 16 bit, sometimes 24 bit (audio processors)
• newer processors might have wider data paths or registers
(TI C6x: 16x16 mpy, 32 bit registers, 40 bit ALU)

16 x 16 mpy

Basic
32 bit
datapath
ALU 40 bit
40 bit
shifter

Select 16 bit
35
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Overflow:

• Saturation logic combined with output shifter

16 x 16 mpy

32 bit
ALU 40 bit
40 bit
Shifter/ saturate

Select 16 bit

• How to implement saturation?

36
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

18
Overflow:

• Input shifter: scaling, line up of the inputs


= loss of precision if shift to much down.

16 x 16 mpy

input Shifter

32 bit
ALU 40 bit
40 bit
Shifter/ saturate

Select 16 bit

37
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Block normalization

• Often used in speech coders because dynamic range of the


input signals is unknown.
• Scale the whole array of values such that the maximum entry
sits in the range [0.5, 1)
• minimum loss of precision

TIC54x:
EXP A <- counts number of sign bits, stores this number in TREG
NORM A <- shifts the accumulator by the number of bits in TREG

Lode:
Repeat N;
A3 = expmn (*r0), r0++; (stores # of sign bits in special register ASR)
Repeat N;
*r0 = *r0 < ASR, r0++;

38
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

19
Pipelining:
Time

Fetch Decode Memory Execute


Access

Fetch Decode Memory Execute


Access

Fetch Decode Memory Execute


Access

Fetch = fetch instruction


Decode = decode instruction
Memory access = address generation and read operands
Execute = perform operation

39
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Pipelining
How does pipeline appears to the programmer?
Lee’s paper (part II) discusses 3 variations
(the difference is often blurry):
• interlocking
• time stationary coding
• data stationary coding

Interlocking: the instructions appear if executed one after another

40
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

20
Interlocking on C10
LT Fetch Decode Memory Execute
Access

MPY Decode Memory Execute


Fetch
Access

LTD Fetch Decode Memory Execute


Access

MPY Fetch Decode Memory Execute


Access

Reservation table:
PMEM LT MPY LTD MPY LTD MPY

DMEM data coef1 data coef2 ...

MPY

ALU
41
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Interlocking on C2x

Programmer does not know the pipeline


If an access conflict occurs: hardware will “stall” and finish one (part) of an
Instruction before finishing a second part.

RPTK 49
MACD

Reservation table:
PMEM RPTK MACD coef1 coef2 coef3

DMEM data1 data2 ...

MPY

ALU
42
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

21
Time stationary

Instruction specifies “one instruction cycle”.


So it specifies, all that occurs in parallel.

Fetch Decode Memory Execute


Access

Fetch Decode Memory Execute


Access

Fetch Decode Memory Execute


Access

Fetch Decode Memory Execute


Access
Example:
Motorola:
MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0
(multiply-acc of values read from memory in the previous cycle
Lucent 16x
a0 = a0 + p, p = x * y, y = *r0++, x = *pt ++
43
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Data stationary
Time stationary: working on different samples in one instruction
Data stationary: describes what happens with one input data from
start to end.

Example (Lode):

*r3++ = a0+ = a2 * *r2++;


(read from memory with pointer reg r2,
Multiply with a2, add to a0 and store back in a0,
Store the result in memory with pointer r3,
Post modify r2 and r3)

Fetch Decode Read Execute Write

44
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

22
Control & Pipeline for DSP’s
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)

Memory Write
Fetch Decode Execute Access Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Fetch Decode Memory Execute Write


Access Back

Execution
Memory access

Excellent for number crunching!


45
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Pipeline RISC compared to DSP


RISC:example r0 = *p0; // load data
a0 = a0 + r0; // execute
Memory
Fetch Decode Execute Access Too expensive for DSP
Memory
Fetch Decode Execute Access
Memory
Fetch Decode Execute Access

DSP: memory intensive applications:


Memory Execute
Fetch Decode Access
Memory Execute
Fetch Decode Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode Access

Penalty: data dependent branch is expensive


46
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

23
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized
assembler - large development effort with minimal tool support

HLL hand coded optimize & debug


assembler prototype production
algorithmic
code code
model

Long, frustrating time to market


Fragile legacy code

Widely used in handhelds, but change in basestations Part II


47
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Lode Core Architecture

48
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

24
Domain specific instruction set

Basic instruction set for general purpose DSP


e.g. MAC, min, max, etc.

Extra instructions for performance with every new generation


e.g. “square distance and accumulate

N-1
D= Σ || x(i) - y(i) ||2
i=0

One 32 bit instruction:


a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;

Bus network and instruction set design go together

CISC, thus compiler unfriendly


49
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

Other control features

Hardware looping:
• Because software branch is expensive
• “Zero overhead hardware loops” (for tight FIR loops)
hardware supported

Interrupts: hardware with shadow registers for extremely fast


context switching.

Special instruction cache:


• Single instruction “repeat” buffer
• Multiple instruction cache: under programmers control!
• E.g. Lucent DSP16210:31x 32 instruction cache

Predictable worst case execution time!

50
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

25
Motivation
• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

ASIC Special Retargetable DSP processors DSP extensions


Purpose coprocessor to RISC
Bit parallel
(Bit serial)
(Art Designer) (Target compiler (TI TMS320C54x, (Gezel,
technologies) TMS320C55x, Tensilica)
ADI Blackfin, etc. )
51
HJ94, Spring 2004, Ingrid Verbauwhede, les 8

26

You might also like