Professional Documents
Culture Documents
EECC722 - Shaaban
#1 lec # 8 Fall 2004 10-
Computing Engine Choices
E.g Digital Signal Processors (DSPs),
Flexibility
Application-Specific
Processors (ASPs)
Configurable Hardware
Selection Factors:
- Type and complexity of computational algorithms
Co-Processors
(general purpose vs. Specialized) Application Specific
- Desired level of flexibility - Performance Integrated Circuits
- Development cost - System cost (ASICs)
- Power requirements - Real-time constrains
Performance
EECC722 - Shaaban
#2 lec # 8 Fall 2004 10-
Digital Signal Processor (DSP) Architecture
• Classification of Processor Applications
• Requirements of Embedded Processors
• DSP vs. General Purpose CPUs
• DSP Cores vs. Chips
• Classification of DSP Applications
• DSP Algorithm Format
• DSP Benchmarks
• Basic Architectural Features of DSPs
• DSP Software Development Considerations
• Classification of Current DSP Architectures and example DSPs:
– Conventional DSPs: TI TMSC54xx
– Enhanced Conventional DSPs: TI TMSC55xx
– VLIW DSPs: TI TMS320C62xx, TMS320C64xx
– Superscalar DSPs: LSI Logic ZSP400 DSP core
EECC722 - Shaaban
#3 lec # 8 Fall 2004 10-
Processor Applications
Cost/Complexity
• General Purpose Processors (GPPs) - high performance.
Increasing
– RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS ...
– Used for general purpose software
– Heavy weight OS - Windows, UNIX
– Workstations, Desktops (PC’s), Clusters
• Embedded processors and processor cores
– e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800...
– Often require Digital signal processing (DSP) support or other
application-specific support (e.g network, media processing)
– Single program
volume
Increasing
– Lightweight, often realtime OS or no OS
– Examples: Cellular phones, consumer electronics .. (e.g. CD players)
• Microcontrollers
– Extremely cost/power sensitive
– Single program
– Small word size - 8 bit common
– Highest volume processors by far
– Examples: Control systems, Automobiles, toasters, thermostats, ...
EECC722 - Shaaban
Examples of Application-Specific Processors #4 lec # 8 Fall 2004 10-
$30B Processor Markets
32-bit
micro
$5.2B/17%
$1.2B/4% 32 bit DSP
DSP $10B/33%
16-bit $5.7B/19%
micro
8-bit $9.3B/31%
micro
EECC722 - Shaaban
#5 lec # 8 Fall 2004 10-
The Processor Design Space
Application specific
architectures
for performance Microprocessors
Embedded
Real-time constraints processors GPPs
Performance
Specialized applications
Low power/cost constraints Performance is
everything
& Software rules
Microcontrollers
Cost is everything
Processor Cost
EECC722 - Shaaban
#6 lec # 8 Fall 2004 10-
Requirements of Embedded Processors
• Usually must meet real-time constraints
• Optimized for a single program - code often in on-chip ROM or off
chip EPROM
• Minimum code size (one of the motivations initially for Java)
• Performance obtained by optimizing datapath
• Low cost
– Lowest possible area
• High computational efficiency: Computation per unit area
– Implementation technology usually behind the leading edge
– High level of integration of peripherals (System-on-Chip -SoC- approach
reduces system cost)
• Fast time to market
– Compatible architectures (e.g. ARM family) allows reusable code
– Customizable cores (System-on-Chip, SoC).
• Low power if application requires portability
EECC722 - Shaaban
#7 lec # 8 Fall 2004 10-
Embedded Processors
Area of processor cores = Cost
EECC722 - Shaaban
#11 lec # 8 Fall 2004
Evolution of GPPs and DSPs
• General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly,
Von Neumann (ENIAC)
• DSP processors are microprocessors designed for efficient mathematical
manipulation of digital signals.
– DSP evolved from Analog Signal Processors (ASPs), using analog
hardware to transform physical signals (classical electrical
engineering)
– ASP to DSP because
• DSP insensitive to environment (e.g., same response in snow or desert if it
works at all)
• DSP performance identical even with variations in components; 2 analog
systems behavior varies even if built with same components with 1%
variation
• Different history and different applications requirements led to different
terms, different metrics, architectures, some new inventions.
EECC722 - Shaaban
#12 lec # 8 Fall 2004
DSP vs. General Purpose CPUs
• DSPs tend to run one program, not many programs.
– Hence OSes (if any) are much simpler, there is no virtual memory or protection, ...
• DSPs usually run applications with hard real-time constraints:
– DSP must meet application signal sampling rate computational requirements:
• A faster DSP is overkill (higher DSP cost, power..)
– You must account for anything that could happen in a time slot (DSP algorithm
inner-loop, data sampling rate)
– All possible interrupts or exceptions must be accounted for and their collective
time be subtracted from the time interval.
• Therefore, exceptions are BAD.
• DSPs usually process infinite continuous data streams:
– Requires high memory bandwidth for streaming real-time data samples
• The design of DSP architectures and ISAs driven by the requirements of
DSP algorithms.
– Thus DSPs are application-specific processors
EECC722 - Shaaban
#13 lec # 8 Fall 2004
DSP vs. GPP
• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
– MAC is common in DSP algorithms that involve computing a vector dot product, such as
digital filters, correlation, and Fourier transforms.
– DSP are judged by whether they can keep the multipliers busy 100% of the time and by how
many MACs are performed in each cycle.
• The "SPEC" of DSPs is 4 algorithms:
– Inifinite Impule Response (IIR) filters
– Finite Impule Response (FIR) filters
– FFT, and
– convolvers
• In DSPs, target algorithms are important:
– Binary compatibility not a major issue
• High-level Software is not as important in DSPs as in GPPs.
– People still write in assembly language for a product to minimize the die area for
ROM in the DSP chip.
EECC722 - Shaaban
#14 lec # 8 Fall 2004
Types of DSP Processors
• 32-BIT FLOATING POINT (5% of market):
– TI TMS320C3X, TMS320C67xx
– AT&T DSP32C
– ANALOG DEVICES ADSP21xxx
– Hitachi SH-4
EECC722 - Shaaban
#15 lec # 8 Fall 2004
DSP Cores vs. Chips
DSP are usually available as synthesizable cores or off-the-
shelf packaged chips
• Synthesizable Cores:
– Map into chosen fabrication process
• Speed, power, and size vary
– Choice of peripherals, etc. (SoC)
– Requires extensive hardware development effort.
EECC722 - Shaaban
#16 lec # 8 Fall 2004
DSP ARCHITECTURE
Enabling Technologies
EECC722 - Shaaban
#17 lec # 8 Fall 2004
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
Multiprocessor
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
EECC722 - Shaaban
#18 lec # 8 Fall 2004
DSP Applications
• Digital audio applications • Industrial control
– MPEG Audio • Seismic exploration
– Portable audio • Networking:
• Digital cameras (Telecom infrastructure)
• Cellular telephones – Wireless
• Wearable medical appliances – Base station
• Storage products: – Cable modems
– disk drive servo control – ADSL
• Military applications: – VDSL
– radar – …...
– sonar
EECC722 - Shaaban
#19 lec # 8 Fall 2004
DSP Applications
DSP Algorithm System Application
Digital cellular telephones, personal communications systems, digital cordless telephones,
Speech Coding
multimedia computers, secure communications.
Digital cellular telephones, personal communications systems, digital cordless telephones,
Speech Encryption
secure communications.
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
Speech Recognition
cellular telephones, personal communications systems.
Speech Synthesis Advanced user interfaces, robotics
Speaker Identification Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
High-fidelity Audio
computers
Digital cellular telephones, personal communications systems, digital cordless telephones,
Modems digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Noise cancellation Professional audio, advanced vehicular audio, industrial applications
Audio Equalization Consumer audio, professional audio, advanced vehicular audio, music
Ambient Acoustics Emulation Consumer audio, professional audio, advanced vehicular audio, music
Audio Mixing/Editing Professional audio, music, multimedia computers
Sound Synthesis Professional audio, music, multimedia computers, advanced user interfaces
Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
Vision
navigation
Image Compression Digital photography, digital video, multimedia computers, videoconferencing
Image Compositing Multimedia computers, consumer video, advanced user interfaces, navigation
Beamforming Navigation, medical imaging, radar/sonar, signals intelligence
Echo cancellation Speakerphones, hands-free cellular telephones
Spectral Estimation Signals intelligence, radar/sonar, professional audio, music
EECC722 - Shaaban
#20 lec # 8 Fall 2004
Another Look at DSP Applications
• High-end
Increasing
– Military applications (e.g. radar/sonar)
– Wireless Base Station - TMS320C6000
Cost
– Cable modem
– gateways
• Mid-end
– Industrial control
– Cellular phone - TMS320C540
– Fax/ voice server
• Low end
volume
Increasing
– Storage products - TMS320C27 (hard drive controllers)
– Digital camera - TMS320C5000
– Portable phones
– Wireless headsets
– Consumer audio
– Automobiles, toasters, thermostats, ...
EECC722 - Shaaban
#21 lec # 8 Fall 2004
DSP range of applications
EECC722 - Shaaban
#22 lec # 8 Fall 2004
CELLULAR TELEPHONE
SYSTEM
123 CONTROLLER 415-555-1212
456
789
0
PHYSICAL RF
LAYER BASEBAND
CONVERTER MODEM
PROCESSING
EECC722 - Shaaban
#23 lec # 8 Fall 2004
HW/SW/IC PARTITIONING
MICROCONTROLLER
123
456 CONTROLLER 415-555-1212
789
0
PHYSICAL
BASEBAND RF
LAYER
ASIC CONVERTER MODEM
PROCESSING
DSP
ANALOG IC
EECC722 - Shaaban
#24 lec # 8 Fall 2004
Mapping Onto System-on-Chip (SoC)
phone keypad
S/P
book intfc
S/P
RAM
RAM µC
DMA speech
voice
quality
recognition
enhancment
ASIC DSP RPE-LTP
de-intl &
LOGIC CORE decoder speech decoder
demodulator
and Viterbi
synchronizer equalizer
EECC722 - Shaaban
#25 lec # 8 Fall 2004
Example Wireless Phone Organization
C540
(DSP)
ARM7
(µC)
EECC722 - Shaaban
#26 lec # 8 Fall 2004
Multimedia I/O Architecture
Radio Embedded
Modem Processor
Sched ECC Pact Interface
EECC722 - Shaaban
#27 lec # 8 Fall 2004
Multimedia System-on-Chip (SoC)
e.g. Multimedia terminal electronics
Graphics Out
Uplink Radio Video I/O
Pen In
Coms
specific algorithms and I/O
Memory
DSP
EECC722 - Shaaban
#28 lec # 8 Fall 2004
DSP Algorithm Format
EECC722 - Shaaban
#29 lec # 8 Fall 2004
DSP Algorithm Notation
• Uses “flowchart” notation instead of equations
• Multiply is or
X
• Add is or
+
• Delay/Storage is or or
Delay z–1 D
EECC722 - Shaaban
#30 lec # 8 Fall 2004
Typical DSP Algorithm:
Finite-Impulse Response (FIR) Filter
• Filters reduce signal noise and enhance image or signal quality by
removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute:
N 1
where
y (i ) h(k ) x(i k ) h(n) * x(n)
– 0
x is the inputksequence
– y is the output sequence
– h is the impulse response (filter coefficients)
– N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse
response.
EECC722 - Shaaban
#31 lec # 8 Fall 2004
Typical DSP Algorithms:
Finite-impulse Response (FIR) Filter
• N most recent samples in the delay line (Xi)
• New sample moves data down delay line
• Filter “Tap” is a multiply-add
• Each tap (N taps total) nominally requires:
EECC722 - Shaaban
#32 lec # 8 Fall 2004
FINITE-IMPULSE RESPONSE (FIR) FILTER
X ....
Z 1 Z 1 Z 1
h0 h1 hN-2 hN-1
A Tap N 1
y (i ) h(k ) x(i k )
Goal: at least 1 FIR Tap / DSP instruction cycle k 0
DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill
EECC722 - Shaaban
#33 lec # 8 Fall 2004
Sample Computational Rates
for FIR Filtering
Signal type Frequency # taps Performance
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2. OP = Operation
EECC722 - Shaaban
#35 lec # 8 Fall 2004
Typical DSP Algorithms:
Infinite-Impulse Response (IIR) Filter
• Infinite Impulse Response (IIR) filters compute:
M 1 N 1
y (i ) a(k ) y(i k ) b(k ) x(i k )
k 1 k 0
• Output sequence depends on input sequence, previous
outputs, and impulse response.
• Both FIR and IIR filters
– Require vector dot product (multiply-accumulate) operations
– Use fixed coefficients
• Adaptive filters update their coefficients to minimize the
distance between the filter output and the desired signal.
EECC722 - Shaaban
#36 lec # 8 Fall 2004
Typical DSP Algorithms:
Discrete Fourier Transform
• The Discrete Fourier Transform (DFT) allows for spectral
analysis in the frequency domain.
• It is computed as
N 1 2 j
y (k ) WN x(n)
nk
WN e N j 1
for k = 0, 1, … , N-1,
n 0 where
– x is the input sequence in the time domain
– y is an output sequence in the frequency domain
• The Inverse Discrete Fourier Transform is computed as
EECC722 - Shaaban
#37 lec # 8 Fall 2004
Typical DSP Algorithms:
Discrete Cosine Transform (DCT)
• The Discrete Cosine Transform (DCT) is frequently used
in image & video compression (e.g. JPEG, MPEG-2).
• The DCT and Inverse DCT (IDCT) are computed as:
N 1
(2n 1)k
y (k ) e(k ) cos[ ]x(n), for k 0, 1, ... N - 1
n 0 2N
2 N 1
(2n 1)k
x ( n)
N
e(k ) cos[ 2 N ] y(n), for k 0, 1, ... N - 1
k 0
EECC722 - Shaaban
#38 lec # 8 Fall 2004
DSP BENCHMARKS
• DSPstone: University of Aachen, application benchmarks
– ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES
– DOT_PRODUCT, MATRIX_1X3, CONVOLUTION
– FIR, FIR2DIM, HR_ONE_BIQUAD
– LMS, FFT_INPUT_SCALED
3rd
Generation
2nd
Generation
1st
Generation
EECC722 - Shaaban
#40 lec # 8 Fall 2004
Basic Architectural Features of DSPs
• Data path configured for DSP algorithms
– Fixed-point arithmetic (most DSPs)
• Modulo arithmetic (saturation to handle overflow)
– MAC- Multiply-accumulate unit(s)
– Hardware rounding support
• Multiple memory banks and buses -
– Harvard Architecture Usually with no data cache
for predictable fast data sample
– Multiple data memories streaming
• Specialized addressing modes
– Bit-reversed addressing
– Circular buffers
• Specialized instruction set and execution control
– Zero-overhead loops
To meet real-time signal
– Support for fast MAC sampling/processing constraints
– Fast Interrupt Handling
• Specialized peripherals for DSP
EECC722 - Shaaban
#41 lec # 8 Fall 2004
DSP Data Path: Arithmetic
• DSPs dealing with numbers representing real world signals
=> Want “reals”/ fractions
• DSPs dealing with numbers for addresses
=> Want integers
• Support “fixed point” as well as integers
S . -1 Š x < 1
radix
point
S .
radix
–2N–1 Š x < 2N–1
point
Usually 16-bit
EECC722 - Shaaban
#42 lec # 8 Fall 2004
DSP Data Path: Precision
• Word size affects precision of fixed point numbers
• DSPs have 16-bit, 20-bit, or 24-bit data words
• Floating Point DSPs cost 2X - 4X vs. fixed point, slower than
fixed point
• DSP programmers will scale values inside code
– SW Libraries
– Separate explicit exponent
• “Blocked Floating Point” single exponent for a group of fractions
• Floating point support simplify development for high-end DSP
applications.
EECC722 - Shaaban
#43 lec # 8 Fall 2004
DSP Data Path: Overflow
• DSP are descended from analog :
– Modulo Arithmetic.
• Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”
• Many DSP algorithms were developed in this
model. N–1
2 –1
Due to physical
nature of signals
–2N–1
EECC722 - Shaaban
#44 lec # 8 Fall 2004
DSP Data Path: Multiplier
• Specialized hardware performs all key arithmetic
operations in 1 cycle, including:
– Shifters
– Saturation
– Guard bits
– Rounding modes
– Multiplication/addition
• 50% of instructions can involve multiplier
=> single cycle latency multiplier
• Need to perform multiply-accumulate (MAC) fast
• n-bit multiplier => 2n-bit product
EECC722 - Shaaban
#45 lec # 8 Fall 2004
DSP Data Path: Accumulator
• Don’t want overflow or have to scale accumulator
• Option 1: accumalator wider than product:
“guard bits”
– Motorola DSP:
24b x 24b => 48b product, 56b Accumulator
• Option 2: shift right and round product before adder
Multiplier
Multiplier
Shift
ALU ALU
Accumulator G Accumulator
EECC722 - Shaaban
#46 lec # 8 Fall 2004
DSP Data Path: Rounding
• Even with guard bits, will need to round when storing accumulator
into memory
• 3 DSP standard options (supported in hardware)
• Truncation: chop results
=> biases results up
• Round to nearest:
< 1/2 round down, 1/2 round up (more positive)
=> smaller bias
• Convergent:
< 1/2 round down, > 1/2 round up (more positive), = 1/2 round to
make lsb a zero (+1 if 1, +0 if 0)
=> no bias
IEEE 754 calls this round to nearest even
EECC722 - Shaaban
#47 lec # 8 Fall 2004
Data Path Comparison
DSP Processor General-Purpose Processor
EECC722 - Shaaban
#48 lec # 8 Fall 2004
TI 320C54x DSP (1995) Functional Block Diagram
Multiple memory
banks and buses
MAC
Unit
EECC722 - Shaaban
#50 lec # 8 Fall 2004
First Generation DSP P
Texas Instruments TMS32010 - 1982
Features
EECC722 - Shaaban
#51 lec # 8 Fall 2004
TMS32010 BLOCK DIAGRAM
MAC
Unit
EECC722 - Shaaban
#52 lec # 8 Fall 2004
TMS32010 FIR Filter Code
EECC722 - Shaaban
#53 lec # 8 Fall 2004
Micro-architectural impact - MAC
N1
h(m)x(n m)
element of finite-impulse
y(n) response filter computation
0 X Y
MPY
ADD/SUB
ACC REG
EECC722 - Shaaban
#54 lec # 8 Fall 2004
Mapping of the filter onto a DSP execution unit
Infinite-Impulse Response (IIR) Filter
4 6
1 3 5
Xn X Yn 1 2
2
Y X
6
D
n-1
4
5 D
3
EECC722 - Shaaban
#55 lec # 8 Fall 2004
MAC Eg. - 320C54x DSP Functional Block Diagram
Multiple memory
banks and buses
MAC
Unit
EECC722 - Shaaban
#56 lec # 8 Fall 2004
DSP Memory
• FIR Tap implies multiple memory accesses
• DSPs require multiple data ports
• Some DSPs have ad hoc techniques to reduce memory bandwdith
demand:
– Instruction repeat buffer: do 1 instruction 256 times
– Often disables interrupts, thereby increasing interrupt response time
• Some recent DSPs have instruction caches
– Even then may allow programmer to “lock in” instructions into cache
– Option to turn cache into fast program memory
• Usually DSPs have no data caches.
• May have multiple data memories
EECC722 - Shaaban
#57 lec # 8 Fall 2004
Conventional ``Von Neumann’’ memory
EECC722 - Shaaban
#58 lec # 8 Fall 2004
HARVARD MEMORY ARCHITECTURE in DSP
PROGRAM
X MEMORY Y MEMORY
MEMORY
GLOBAL
P DATA
X DATA
Y DATA
Multiple memory
banks and buses
EECC722 - Shaaban
#59 lec # 8 Fall 2004
Memory Architecture Comparison
DSP Processor General-Purpose Processor
• Harvard architecture • Von Neumann architecture
• 2-4 memory accesses/cycle • Typically 1 access/cycle
• No caches: on-chip SRAM • Use caches
Program
Memory
Processor Processor Memory
Data
Memory
EECC722 - Shaaban
#60 lec # 8 Fall 2004
Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard
Architecture
Multiple memory
banks and buses
EECC722 - Shaaban
#61 lec # 8 Fall 2004
Eg. TI 320C62x/67x DSP (1997)
EECC722 - Shaaban
#62 lec # 8 Fall 2004
DSP Addressing
• Have standard addressing modes: immediate, displacement, register indirect
• Want to keep MAC datapath busy
• Assumption: any extra instructions imply clock cycles of overhead in inner
loop
=> complex addressing is good
• Autoincrement/Autodecrement register indirect
– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1
– Option to do it before addressing, positive or negative
• “bit reverse” address addressing mode.
• “modulo” or “circular” addressing
=> don’t use normal datapath to calculate fancy addressing modes:
– Use dedicated address generation units
EECC722 - Shaaban
#63 lec # 8 Fall 2004
DSP Addressing: FFT
• FFTs start or end with data in bufferfly order
0 (000) => 0 (000)
1 (001) => 4 (100)
2 (010) => 2 (010)
3 (011) => 6 (110)
4 (100) => 1 (001)
5 (101) => 5 (101)
6 (110) => 3 (011)
7 (111) => 7 (111)
EECC722 - Shaaban
#64 lec # 8 Fall 2004
BIT REVERSED ADDRESSING
000 x(0) F(0)
EECC722 - Shaaban
#66 lec # 8 Fall 2004
CIRCULAR BUFFERS
EECC722 - Shaaban
#67 lec # 8 Fall 2004
Address calculation for DSPs
• Dedicated address
generation units
• Supports modulo and bit
reversal arithmetic
• Often duplicated to calculate
multiple addresses per cycle
EECC722 - Shaaban
#68 lec # 8 Fall 2004
Addressing Comparison
DSP Processor General-Purpose Processor
• Dedicated address • Often, no separate address
generation units generation units
• Specialized addressing • General-purpose addressing
modes; e.g.: modes
– Autoincrement
– Modulo (circular)
– Bit-reversed (for FFT)
• Good immediate data
support
EECC722 - Shaaban
#69 lec # 8 Fall 2004
DSP Instructions and Execution
• May specify multiple operations in a single instruction
– e.g. A compound instruction may perform:
multiply + add + load + modify address register
• Must support Multiply-Accumulate (MAC)
• Need parallel move support
• Usually have special loop support to reduce branch overhead
– Loop an instruction or sequence
– 0 value in register usually means loop maximum number of times
– Must be sure if calculate loop count that 0 does not mean 0
• May have saturating shift left arithmetic
• May have conditional execution to reduce branches
EECC722 - Shaaban
#70 lec # 8 Fall 2004
DSP Low/Zero Overhead Loops
Example FIR inner loop on TI TMS320C54xx:
X
• Eliminates a few instructions in loops -
• Important in loops with small bodies
EECC722 - Shaaban
#71 lec # 8 Fall 2004
Instruction Set Comparison
DSP Processor
General-Purpose Processor
• Specialized, complex instructions (e.g. MAC)
• Multiple operations per instruction
• General-purpose
instructions
• Typically only one
• Zero or reduced overhead loops. operation per instruction
EECC722 - Shaaban
#72 lec # 8 Fall 2004
Specialized Peripherals for DSPs
• Synchronous serial ports
• Host ports
• Parallel ports Instruction
DSP
Memory • Bit I/O ports
• Timers Core
• On-chip DMA
Serial Ports
• On-chip A/D, D/A A/D Converter Data controller
Memory
converters D/A Converter • Clock generators
SoC
EECC722 - Shaaban
#73 lec # 8 Fall 2004
Specialized DSP peripherals
EECC722 - Shaaban
#74 lec # 8 Fall 2004
TI TMS320C203/LC203 BLOCK DIAGRAM
DSP Core Approach - 1995
Integrated
DSP Peripherals
EECC722 - Shaaban
#75 lec # 8 Fall 2004
Summary of Architectural Features of DSPs
• Data path configured for DSP
– Fixed-point arithmetic
– MAC- Multiply-accumulate
• Multiple memory banks and buses -
– Harvard Architecture
– Multiple data memories
– Dedicated address generation units
• Specialized addressing modes
– Bit-reversed addressing
– Circular buffers
• Specialized instruction set and execution control
– Zero-overhead loops
– Support for MAC
• Specialized peripherals for DSP
EECC722 - Shaaban
(or algorithm driven, DSP algorithms in this case) #76 lec # 8 Fall 2004
DSP Software Development Considerations
• Different from general-purpose software development:
– Resource-hungry, complex algorithms.
– Specialized and/or complex processor architectures.
– Severe cost/storage limitations.
– Hard real-time constraints.
– Optimization is essential.
– Increased testing challenges.
• Essential tools:
– Assembler, linker.
– Instruction set simulator.
– HLL Code generation: C compiler.
– Debugging and profiling tools.
• Increasingly important:
– DSP Software libraries.
– Real-time operating systems.
EECC722 - Shaaban
#77 lec # 8 Fall 2004
Classification of Current DSP Architectures
• Modern Conventional DSPs:
– Similar to the original DSPs of the early 1980s
– Single instruction/cycle. Example: TI TMS320C54x
– Complex instructions/Not compiler friendly
• Multiple-Issue DSPs:
– VLIW Example: TI TMS320C62xx, TMS320C64xx
• Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and
instruction bandwidth needed,
• More compiler friendly - Higher cost/power
• SIMD instructions support added to recent DSPs of this class
– Superscalar, Example: LSI Logic ZPS400, ZPS500
EECC722 - Shaaban
#78 lec # 8 Fall 2004
A Conventional DSP:
TI TMSC54xx
• 16-bit fixed-point DSP.
• Issues one 16-bit instruction/cycle
• Modified Harvard memory architecture
• Peripherals typical of conventional DSPs:
– 2-3 synch. Serial ports, parallel port
– Bit I/O, Timer, DMA
• Inexpensive (100 MHz ~$5 qty 10K).
• Low power (60 mW @ 1.8V, 100 MHz).
EECC722 - Shaaban
#79 lec # 8 Fall 2004
A Current Conventional DSP:
TI TMSC54xx
EECC722 - Shaaban
#80 lec # 8 Fall 2004
An Enhanced Conventional DSP:
TI TMSC55xx
• The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx
family, but adds significant enhancements to the architecture and
instruction set, including:
– Two instructions/cycle
• Instructions are scheduled for parallel execution by the assembly programmer or
compiler.
– Two MAC units.
• Complex, compound instructions:
– Assembly source code compatible with C54xx
– Mixed-width instructions: 8 to 48 bits.
– 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k
• Poor compiler target.
EECC722 - Shaaban
#81 lec # 8 Fall 2004
An Enhanced Conventional DSP:
TI TMSC55xx
2 MAC
Units
EECC722 - Shaaban
#82 lec # 8 Fall 2004
16-bit Fixed-Point VLIW DSP:
TI TMS320C6201 Revision 2 (1997)
(1997
The TMS320C62xx is the
first fixed-point DSP Program Cache / Program Memory
processor from Texas 32-bit address, 256-Bit data512K Bits RAM
Instruments that is based
on a VLIW-like architecture Pwr C6201 CPU Megamodule
EECC722 - Shaaban
#83 lec # 8 Fall 2004
TI TMS320C62xx Internal Memory
Architecture
• Separate Internal Program and Data Spaces
• Program
– 16K 32-bit instructions (2K Fetch Packets)
– 256-bit Fetch Width
– Configurable as either
• Direct Mapped Cache, Memory Mapped Program Memory
• Data
– 32K x 16
– Single Ported Accessible by Both CPU Data Buses
– 4 x 8K 16-bit Banks
• 2 Possible Simultaneous Memory Accesses (4 Banks)
• 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
EECC722 - Shaaban
#84 lec # 8 Fall 2004
TI TMS320C62xx Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
EECC722 - Shaaban
#85 lec # 8 Fall 2004
TI TMS320C62xx Functional Units
• L-Unit (L1, L2)
– 40-bit Integer ALU, Comparisons
– Bit Counting, Normalization
(Statically Scheduled)
EECC722 - Shaaban
#86 lec # 8 Fall 2004
TI TMS320C62xx Instruction Packing
Instruction Packing Advanced VLIW
Example 1
• Fetch Packet
A B C D E F G H – CPU fetches 8 instructions/cycle
• Execute Packet
A – CPU executes 1 to 8 instructions/cycle
B – Fetch packets can contain multiple execute packets
F G H
(Statically Scheduled VLIW)
EECC722 - Shaaban
#87 lec # 8 Fall 2004
TI TMS320C62xx Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
• Single-Cycle Throughput
• Decode
• Operate in Lock Step
– DP Instruction Dispatch
• Fetch – DC Instruction Decode
– PG Program Address Generate • Execute
– PS Program Address Send – E1 - E5 Execute 1 through Execute 5
– PW Program Access Ready Wait
– PR Program Fetch Packet Receive
PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5
EECC722 - Shaaban
#88 lec # 8 Fall 2004
C62x Pipeline Operation
Delay Slots
• Delay Slots: number of extra cycles until result is:
– written to register file
– available for use by a subsequent instructions
– Multi-cycle NOP instruction can fill delay slots while minimizing
code size impact
EECC722 - Shaaban
(Statically Scheduled VLIW) #89 lec # 8 Fall 2004
C6000 Instruction Set Features
Conditional Instructions
EECC722 - Shaaban
#90 lec # 8 Fall 2004
C6000 Instruction Set Addressing
Features
• Load-Store Architecture
• Two Addressing Units (D1, D2)
• Orthogonal
– Any Register can be used for Addressing or Indexing
• Signed/Unsigned Byte, Half-Word, Word, Double-
Word Addressable
– Indexes are Scaled by Type
• Register or 5-Bit Unsigned Constant Index
EECC722 - Shaaban
#91 lec # 8 Fall 2004
C6000 Instruction Set Addressing
Features
• Indirect Addressing Modes
– Pre-Increment *++R[index]
– Post-Increment *R++[index]
– Pre-Decrement *--R[index]
– Post-Decrement *R--[index]
– Positive Offset *+R[index]
– Negative Offset *-R[index]
• 15-bit Positive/Negative Constant Offset from Either B14 or
B15
• Circular Addressing
– Fast and Low Cost: Power of 2 Sizes and Alignment
– Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes
• Dual Endian Support
EECC722 - Shaaban
#92 lec # 8 Fall 2004
FIR Filter On TMS320C54xx vs. TMS320C62xx
EECC722 - Shaaban
#93 lec # 8 Fall 2004
EECC722 - Shaaban
#94 lec # 8 Fall 2004
EECC722 - Shaaban
#95 lec # 8 Fall 2004
TI TMS320C64xx
• Announced in February 2000, the TMS320C64xx is an extension of
Texas Instruments' earlier TMS320C62xx architecture.
• The TMS320C64xx has 64 32-bit general-purpose registers, twice as
many as the TMS320C62xx.
• The TMS320C64xx instruction set is a superset of that used in the
TMS320C62xx, and, among other enhancements, adds significant
SIMD/media processing capabilities:
– 8-bit operations for image/video processing.
• Introduced at 600 MHz clock speed (1 GHz now), but:
– 11-stage pipeline with long latencies
– Dynamic caches.
• $100 qty 10k.
• The only DSP family with compatible fixed and floating-point versions.
EECC722 - Shaaban
#96 lec # 8 Fall 2004
C64xx (also C62xx and C67xx) VLIW have higher memory use
due to simpler (RISC-like, fixed-width) instructions than conventional DSPs,
more instructions and instruction bandwidth needed,
Also VLIW but with variable-length instruction encoding (less memory use than C64xx)
EECC722 - Shaaban
#97 lec # 8 Fall 2004
EECC722 - Shaaban
#98 lec # 8 Fall 2004
Superscalar DSP:
LSI Logic ZSP400
• A 4-way superscalar dynamically scheduled 16-bit fixed-
point DSP core.
• 16-bit RISC-like instructions
• Separate on-chip caches for instructions and data
• Two MAC units, two ALU/shifter units
– Limited SIMD support.
– MACS can be combined for 32-bit operations.
• Possible Disadvantage:
– Dynamic behavior complicates DSP software development:
• Ensuring real-time behavior
• Optimizing code.
EECC722 - Shaaban
#99 lec # 8 Fall 2004
EECC722 - Shaaban
#100 lec # 8 Fall 2004
TI not actively improving their flagship
FP DSP (fixed-point more important)
EECC722 - Shaaban
#101 lec # 8 Fall 2004