Professional Documents
Culture Documents
HJ94 Slides 8 DSP
HJ94 Slides 8 DSP
Fundamentals
Ingrid Verbauwhede
Iverbauw@esat.kuleuven.ac.be
1
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Motivation
• Architecture exploration
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
1
References
• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
3
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Processor Components:
Instruction Memory
Processing Management
Unit Unit
4
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
2
Von Neumann machine
Processor
Core mpy ALU
Address Bus
Data Bus
Memory
5
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
FIR implementation
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
Σ c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +
3
FIR on Von Neumann
7
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Program Data
Memory Memory
Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate
ALU
8
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
4
Example 1: TMS320C10 (1982)
ZAC ACC=0
T (16)
LT X1 T=X1
MPY A P=AX1
Multiplier
LTA X2 ACC=AX1;T=X2
MPY B P=BX2
P (32) LTA X3 ACC=AX1+BX2;T=X3
MPY C P=CX3
MUX LTA X4 ACC=AX1+BX2+CX3;T=X4
MPY D P=DX4
APAC ACC=AX1+BX2+Cx3+DX4
ALU (32)
SACH Y1 STORE 32-BIT RESULT
5
TMS320C1x Memory and Buses
DEN Up to 8K words of
Data Data on-chip Program ROM
MEN
WE
4K words of
Data Address EPROM
16 16
and OTP available
16 8
Up to 64K words
Program Control, CPU External Program
Instruction Register Memory
Program Data
Memory Memory
Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate
ALU
12
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
6
Same FIR: 53 cycles, 3 prog words
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
N-1 (50 TAPS)
y(n) =
Σ c(i) x(n-i) c(0) X X X c(N-1) X
i=0
y(n)
+ + +
TMS320C10 TMS320C25
LT LTD RPTK 49 LT
DMOV MPY MACD DMOV
APAC LTD APAC
MPY 53 Cycles MPY
LTD 3 Words Prog Memory
..
. 100 Cycles
MPY
100 Words Prog Memory
13
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Example: MACD
Executes (simplified):
7
Single Cycle MAC
TMS320C2x Multiplier/ALU
Program Bus
Single Cycle 16x16 bit
Data Bus 16
16 16 16 Multiply yielding a
Left T Register (16) MUX 32-bit product
Shifter 16
(0-16) 16
Multiplier (16x16)
32 Supports simultaneous
P Register (32) Program and two Data
32
Left Shifter (0-16) Operand acquisition
32 32
MUX Supports simultaneous
32
32
ALU and Multiplier
Arithmetic Logic Unit (ALU)
32 operations
C Accumulator Register (32)
32 0-16 bit Left Post-Shifter
16
Left Shifter (0-7)
Courtesy: Texas Instruments 15
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
1986:
80/100ns instruction cycle time
Simultaneous single-cycle Multiply/ALU operations
Zero overhead repeat single instruction
64K words of off-chip Data RAM
Optimizing ANSI C-Compiler
544 words of on-chip Data/Program RAM
Multiplier Post Shifter and enhanced Accumulator Post Shifter
74 additional instructions
- Single-cycle MAC and zero overhead repeat
- Long immediate and carry bit support
- More logical and conditional branch operations
- Data block move support
Bit reversed addressing for FFTs
Eight auxiliary registers
Hardware wait states
DMA support
Idle and Powerdown Capability
8
Other memory configurations
Instruction cache
• single instruction RPTK (repeat in TMS320C2x))
• a few instructions (up to 15 in AT&T 16A)
• ALWAYS under programmers control!
• ALWAYS known at compile time!
17
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
18
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
9
Block Diagram (C54x)
• Memory Access
– 4 internal bus pairs
– C,D for data read
– E for data write
– P for program
• Others
– 2 40-bit Accum.
– 40-bit Barrel shifter
– 40-bit ALU
– 17bx17b multiplier
and 40b dedicated
adder perform a
non pipelined
single-cycle MAC
19
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Addressing modes
• Needs:
• special address registers
• associated Address calculation units
• operate in parallel
• as many ACU’s as memories
20
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
10
Indirect addressing:
r1 = address of last word in the delay line
r2 = address of last coefficient
r3 = address of last word in the delay line
a1 = new input sample
a0 = *r1-- x *r2--;
Repeat 47 times
a0 = a0 + (*r3--=*r1--)x *r2--;
a1 = a0 + (*r3=a1) x *r2;
Read a1
a1
write with r3
x[0] x[n-(N-1)]
read with r1
22
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
11
Circular buffer (cont.)
• Example (C54x)
– BK = buffer size (e.g. 6 = 0110, 7 locations)
– Start at location with xxxx 0000 (4 LSB’s have to be zero)
• used for sliding window type operations: convolution,
correlation, FIR filters, etc.
23
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
1,400 ,0 00
1,000 ,0 00
Subscribers (000)
800 ,0 00
600 ,0 00
400 ,0 00
W ire le s s C A G R 2 1 %
G lo b a l P e n e tra tio n (2 0 1 0 ) - 2 1 %
200 ,0 00
(C e llu la r+ P C S + W L A S + O th e r)
G lo b al P o p - 7 b ill
C AG R 1995 -20 10 - 1.4%
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
12
DSP Evolution and Markets
Disk
DSP Market $270 M Cellular
Infrastructure
Other
$2B market, 30% growth rate
Wireless Mobile Handsets
$1.01B Cordless
Modem
GPS
V.34 $727 M
Source: Forward Concepts 1996
V.90
xDSL Consumer &
Automotive
M68000 ($200)
10K
Power Power
80286 ($200)
(mw/MIP) 1K 80386 ($300)
DSP-1 ($150) (mw/MIP)
Pentium ($300)
DSP-32C ($250)
100
Pentium (MMX)
($700)
DSP16210
1
1980 1985 1990 1995 2000
25
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Today’s
general purpose
assembly coded
Mobile Terminals DSP
Infrastructure
• 100 MOPS
Low cost, High
• 250 mW
low power • $40 Performance
DSPs DSPs
26
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
13
Motivation
• Architecture exploration
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
References
• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
More references:
• P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and
Media Processing,” IEEE Signal Processing Magazine, March 1998,
pg. 59-85, (download from the INSPEC webpage).
• I. Verbauwhede, M. Touriguian, “Wireless Digital Signal Processors,”
Chapter 11 in Digital Signal Processing for Multimedia Systems,
Eds. By K. Parhi, T. Nishitani, Marcel Dekker, Inc.
• C. Nicol, I. Verbauwhede, “DSP Architectures for Next Generation wireless
communications,” ISSCC 2000 tutorial.
28
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
14
Recall: Memory architecture
Key issues:
• Memory bandwidth by multiple memory banks or multi port memories
• Every memory has its OWN address generation unit
operating in parallel
• Special instructions that combine operations with memory moves:
MACD
• Indirect addressing: *r1++ or *r2--
• circular buffers: extra hardware in the address generation units
29
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
15
FIR speed-up
FIR filtering: two outputs in parallel
16
FIR on Lode
No of Memory reads 2N 2N N
33
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
FIR on Lode
DB1(16)
DB0(16)
x(n-i+1) x(n-i)
LREG c(i)
• DB0 fetches coefficient c(i)
• DB1 fetches data
X X
• LREG delays input data
MAC1 MAC0
• A0 stores y(n) output + +
• A1 stores y(n+1) output
y(n+1) A0 y(n) A1
34
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
17
Arithmetic
DSP processors come in two flavors:
• floating point
• most popular one: Sharc’s from Analog Devices
• fixed point
• usually 16 bit, sometimes 24 bit (audio processors)
• newer processors might have wider data paths or registers
(TI C6x: 16x16 mpy, 32 bit registers, 40 bit ALU)
16 x 16 mpy
Basic
32 bit
datapath
ALU 40 bit
40 bit
shifter
Select 16 bit
35
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Overflow:
16 x 16 mpy
32 bit
ALU 40 bit
40 bit
Shifter/ saturate
Select 16 bit
36
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
18
Overflow:
16 x 16 mpy
input Shifter
32 bit
ALU 40 bit
40 bit
Shifter/ saturate
Select 16 bit
37
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Block normalization
TIC54x:
EXP A <- counts number of sign bits, stores this number in TREG
NORM A <- shifts the accumulator by the number of bits in TREG
Lode:
Repeat N;
A3 = expmn (*r0), r0++; (stores # of sign bits in special register ASR)
Repeat N;
*r0 = *r0 < ASR, r0++;
38
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
19
Pipelining:
Time
39
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Pipelining
How does pipeline appears to the programmer?
Lee’s paper (part II) discusses 3 variations
(the difference is often blurry):
• interlocking
• time stationary coding
• data stationary coding
40
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
20
Interlocking on C10
LT Fetch Decode Memory Execute
Access
Reservation table:
PMEM LT MPY LTD MPY LTD MPY
MPY
ALU
41
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
Interlocking on C2x
RPTK 49
MACD
Reservation table:
PMEM RPTK MACD coef1 coef2 coef3
MPY
ALU
42
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
21
Time stationary
Data stationary
Time stationary: working on different samples in one instruction
Data stationary: describes what happens with one input data from
start to end.
Example (Lode):
44
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
22
Control & Pipeline for DSP’s
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)
Memory Write
Fetch Decode Execute Access Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!
Execution
Memory access
23
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized
assembler - large development effort with minimal tool support
48
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
24
Domain specific instruction set
N-1
D= Σ || x(i) - y(i) ||2
i=0
Hardware looping:
• Because software branch is expensive
• “Zero overhead hardware loops” (for tight FIR loops)
hardware supported
50
HJ94, Spring 2004, Ingrid Verbauwhede, les 8
25
Motivation
• Architecture exploration
• Floating point
• Fixed point
• Algorithm transformations
• Architecture alternatives
26