Lecture 4

Lecture 4
Introduction to Digital Signal

Processors (DSPs)
Dr. Konstantinos Tatas

Outline/objectives
• Identify the most important DSP processor
architecture features and how they relate
to DSP applications
• Understand the types of code appropriate
for DSP implementation
ACOE343 - Embedded Real-Time Processor Systems - 2

Frederick University
What is a DSP?
• A specialized microprocessor for real-
time DSP applications
– Digital filtering (FIR and IIR)
– FFT
– Convolution, Matrix Multiplication etc
DIGITAL DIGITAL
ANALOG INPUT OUTPUT ANALOG
ADC DSP DAC
INPUT OUTPUT

Hardware used in DSP
ASIC FPGA GPP DSP
Performance Very High High Medium Medium High
Flexibility Very low High High High
Power Very low low Medium Low Medium

consumption
Development Long Medium Short Short

Time

Common DSP features
• Harvard architecture
• Dedicated single-cycle Multiply-Accumulate
(MAC) instruction (hardware MAC units)
• Single-Instruction Multiple Data (SIMD) Very
Large Instruction Word (VLIW) architecture
• Pipelining
• Saturation arithmetic
• Zero overhead looping
• Hardware circular addressing
• Cache
• DMA
Harvard Architecture
• Physically separate
DATA
memories and paths MEMORY
for instruction and
data CPU
PROGRAM
MEMORY

Single-Cycle MAC unit
ai xi
Multiplier
a i-1 x i-1
n
Σ(a ix i )
ai xi
Adder
i=0
a i x i + a i-1 x i-1
Can compute a sum of n-
Register
products in n cycles

Single Instruction - Multiple Data
(SIMD)
• A technique for data-level parallelism by
employing a number of processing
elements working in parallel

Very Long Instruction Word (VLIW)
• A technique for
VLIW instruction F=a+b c=e/g d=x&y w=z*h
instruction-level a
F
parallelism by executing b PU
instructions without
dependencies (known at e PU
c
compile-time) in parallel g
• Example of a single x d
PU
VLIW instruction: y
F=a+b; c=e/g; d=x&y; w=z*h;

z w
PU
h

CISC vs. RISC vs. VLIW

Pipelining
• DSPs commonly feature deep pipelines
• TMS320C6x processors have 3 pipeline stages
with a number of phases (cycles):
– Fetch
• Program Address Generate (PG)
• Program Address Send (PS)
• Program ready wait (PW)
• Program receive (PR)
– Decode
• Dispatch (DP)
• Decode (DC)
– Execute
• 6 to 10 phases

Saturation Arithmetic
• fixed range for operations like addition and
multiplication
• normal overflow and underflow produce the
maximum and minimum allowed value,
respectively
• Associativity and distributivity no longer apply
• 1 signed byte saturation arithmetic examples:
• 64 + 69 = 127
• -127 – 5 = -128
• (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109

Examples
• Perform the following operations using
one-byte saturation arithmetic
• 0x77 + 0x99 =
• 0x4*0x42=
• 0x3*0x51=

Zero Overhead Looping
• Hardware support for loops with a
constant number of iterations using
hardware loop counters and loop buffers
• No branching
• No loop overhead
• No pipeline stalls or branch prediction
• No need for loop unrolling

Hardware Circular Addressing
• A data structure Head
implementing a fixed X[n]
length queue of fixed size

X[n-1]
objects where objects are
added to the head of the X[n]
queue while items are Cycle1
removed from the tail of X[n-1] X[n-2] Cycle2
the queue.
• Requires at least 2
X[n-2]
X[n-3] X[n-3]
pointers (head and tail)
• Extensively used in digital
filtering Tail
y[n] = a0x[n]+a1x[n-1]+…+akx[n-k]

Direct Memory Access (DMA)
• The feature that allows peripherals to access
main memory without the intervention of the
CPU
• Typically, the CPU initiates DMA transfer, does
other operations while the transfer is in
progress, and receives an interrupt from the
DMA controller once the operation is complete.
• Can create cache coherency problems (the data
in the cache may be different from the data in
the external memory after DMA)
• Requires a DMA controller

Cache memory
• Separate instruction and data L1 caches
(Harvard architecture)
• Cache coherence protocols required,
since most systems use DMA

DSP vs. Microcontroller
• DSP • Microcontroller
– Harvard Architecture – Mostly von Neumann
– VLIW/SIMD (parallel Architecture
execution units) – Single execution unit
– No bit level operations – Flexible bit-level
– Hardware MACs operations
– DSP applications – No hardware MACs
– Control applications

Examples
• Estimate how long will the following code
fragment take to execute on
– A general purpose processor with 1 GHz operating
frequency, five-stage pipelining and 5 cycles required
for multiplication, 1 cycle for addition
– A DSP running at 500 MHz, zero overhead looping
and 6 independent ALUs and 2 independent single-
cycle MAC units?
for (i=0; i<8; i++)

{
a[i] = 2*i + 3;
b[i] = 3*i + 5;
}
Review Questions
• Which of the following code fragments is
appropriate for SIMD implementation?
a[0]=b[0]+c[0]; a[0]=b[0]&c[0];
a[2]=b[2]+c[2]; a[0]=b[0]%c[0];
a[4]=b[4]+c[4]; a[0]=b[0]+c[0];
a[6]=b[6]+c[6]; a[0]=b[0]/c[0];
• Can the following instructions be merged into
one VLIW instruction? If not in how many?
– a=b+c;
– d=c/e;
– f=d&a;
– g=b%c;

Review Questions
• Which of the following is not a typical DSP
feature?
– Dedicated multiplier/MAC
– Von Neumann memory architecture
– Pipelining
– Saturation arithmetic
• Which implementation would you choose for
lowest power consumption?
– ASIC
– FPGA
– General-Purpose Processor
– DSP
Examples
• How many VLIW instructions does the following program
fragment require if there two independent data paths
(a,b), with 3 ALUs and 1 MAC available in each and 8
instructions/word? How many cycles will it take to
execute if they are the first instructions in the program
and all instructions require 1 cycle, assuming the
pipelining architecture of slide 10 with 6 phases of
execution?
ADD a1,a2,a3 ;a3 = a1+a2
SUB b1,b3,b4 ;b4 = b1-b3
MUL a2,a3,a5 ;a5 = a2-a3
MUL b3,b4,b2 ;b2 = b3*b4
AND a7,a0,a1 ;a1 = a7 AND a0
MUL a3,a4,a5 ;a5 = a3*a4
OR a6,a3,a2 ;a2 = a6 OR a3
References
• DR. Chassaing, “DSP Applications using C
and the TMS320C6x DSK”, Wiley, 2002
• Texas Instruments, TMS320C64x
datasheets
• Analog Devices, ADSP-21xx Processors


Lecture 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

Lecture 4

Introduction to Digital Signal

Dr. Konstantinos Tatas

ACOE343 - Embedded Real-Time Processor Systems - 2

ACOE343 - Embedded Real-Time Processor Systems - 3

Performance Very High High Medium Medium High

Flexibility Very low High High High

Power Very low low Medium Low Medium

Development Long Medium Short Short

ACOE343 - Embedded Real-Time Processor Systems - 4

ACOE343 - Embedded Real-Time Processor Systems - 6

ACOE343 - Embedded Real-Time Processor Systems - 7

ACOE343 - Embedded Real-Time Processor Systems - 8

F=a+b; c=e/g; d=x&y; w=z*h;

ACOE343 - Embedded Real-Time Processor Systems - 9

ACOE343 - Embedded Real-Time Processor Systems - 10

ACOE343 - Embedded Real-Time Processor Systems - 11

ACOE343 - Embedded Real-Time Processor Systems - 12

ACOE343 - Embedded Real-Time Processor Systems - 13

ACOE343 - Embedded Real-Time Processor Systems - 14

implementing a fixed X[n]

length queue of fixed size

queue while items are Cycle1

removed from the tail of X[n-1] X[n-2] Cycle2

ACOE343 - Embedded Real-Time Processor Systems - 15

ACOE343 - Embedded Real-Time Processor Systems - 16

ACOE343 - Embedded Real-Time Processor Systems - 17

ACOE343 - Embedded Real-Time Processor Systems - 18

for (i=0; i<8; i++)

ACOE343 - Embedded Real-Time Processor Systems - 20

ACOE343 - Embedded Real-Time Processor Systems - 23

You might also like