You are on page 1of 90

2016/2017

Filière : ING en Génie Electrique


Semestre : S4

Higher School of Technical Education


Electrical Engineering Department
Rabat, Morocco

Elément de module:
Traitement du signal et implémentation sur DSP
Digital Signal Processing and Applications with the
TMS320C6713 DSK

Prof A.JBARI
atmjbari@gmail.com

Pr. A.JBARI DSP implementation 1


Objectives

Become familiar with


 DSP basics
 TMS320C6713 floating point DSP architecture
 TMS320C6713 DSP starter kit (DSK)
 Code composer studio integrated development environment (IDE)
 Matlab design and analysis tools
 Learn how to program the C6713
 Writing and compiling code
 Fixing errors
 Downloading code to the target and executing
 Debugging
 Write and run useful programs on the C6713 DSK
 Learn about DSP applications
 Learn where to find help

Pr. A.JBARI DSP implementation 2


Bibliography

 Books:
– Digital Signal Processing Using MATLAB®, Third Edition, Vinay K. Ingle and John G.
Proakis, (Cengage Learning 200 First Stamford Place, Suite 400, Stamford, CT 06902,
USA)
– Digital Signal Processing Fundamentals and Applications, Li Tan, DeVry University
Decatur, Georgia, Copyright 2008, Elsevier Inc
– Digital Signal Processing and Applications with the C6713 and C6416 DSK, Rulph
Chassaing,Worcester Polytechnic Institute, Copyright © 2005 by John Wiley & Sons, Inc.
– DSP Applications Using C and the TMS320C6x DSK, Rulph Chassaing, Copyright © 2002
by John Wiley & Sons, Inc.
 Web documents and links:
• http://www.analog.com/media/en/technical-documentation/data-sheets/ADSP-
2101_2103_2105_2115.pdf?doc=AD7475_7495.pdf
• ftp://ftp.analog.com/pub/cftl/ADI%20Classics/Mixed%20Signal%20and%20DSP%20Design%20Tec
hniques,%202000/Section_7_DSP_Hardware.pdf
• dspworkshop_part1_2007.pdf , dspworkshop_part2_2007.pdf
• …

Pr. A.JBARI DSP implementation 3


Outline
1. Data representation
2. Digital Signal Processing
a) Digital Systems and fundmantals
b) Fast Fourier Transform (FFT)
c) Digital Filters Architectures
3. DSP Architectures
4. Programmation of DSP TMS320C6713
5. Applications

Pr. A.JBARI DSP implementation 4


1. Data representation

Integer representation

 Unsigned Integers: can represent zero and positive integers.


 Signed Integers: can represent zero, positive and negative integers.
The most-significant bit (msb) is called the sign bit. The sign bit is used to represent
the sign of the integer - with 0 for positive integers and 1 for negative integers.

Representation Positive Negative


Sign-Magnitude representation [(MSB=0) N] [(MSB=1) magnitude(N)]

1's Complement representation [(MSB=0) N] [(MSB=1) 1’s complement(N)]

2's Complement representation [(MSB=0) N] [(MSB=1) 2’s complement(N)]

Excess (or bias) representation B


N+B
Pr. A.JBARI DSP implementation 5
Integer representation

Example:
27=0 0 0 1 1 0 1 1

Representation Positive Negative


Sign-Magnitude representation 27 = 00011011 -27= 10011011

1's Complement representation 27= 00011011 -27=11100100

2's Complement representation 27= 00011011 -27=11100101

Excess (or bias) representation 27  27+63=90 = -27  -27+63=36 =


63 01011010 00100100

Pr. A.JBARI DSP implementation 6


Integer representation

 Types: Computer can manipulate integers of various lengths (format):

Format Type Range


8 bits byte (in Java, char type in C, C++) [ -127 128]

16 bits short (in Java, C and C++) [-32767 32768]

32 bits int (in Java, C and C++) [-231 231 - 1]


64 bits long (in Java, C and C++) [ -263 and 263 - 1]

128 bits long long (in C and C++)

Pr. A.JBARI DSP implementation 7


Sign extension

 Sign extension (widening conversion):

– 8 bit 2s compl. repr. for 7 is: 00000111

– 16 bit 2s compl. repr. for 7 is: 0000000000000111

– 8 bit 2s compl. repr. for -7 is: 11111001

– 16 bit 2s compl. repr. for -7 is: 1111111111111001

Pr. A.JBARI DSP implementation 8


Fixed--point representation
Fixed

 To define a fixed point type conceptually, all we need are two parameters:
width of the number representation, and binary point position within the
number
Real number= integer part
· Fractional part

 Notation : fixed<w,b> : w denotes the number of bits used as a whole


(the Width of a number), and b denotes the position of binary point
counting from the least significant bit (counting from 0).
• Examples:
 N = 15,75
 Format fixed(8,3): 01111110
 Format fixed(32,16): 00000000000011111100000000000000

Pr. A.JBARI DSP implementation 9


Floating--point representation
Floating
 A floating-point number is typically expressed in the scientific notation with
a fraction (F) (mantissa), and an exponent (E) of a certain radix (r), in the form of

N= F× r ^ E:
 Decimal numbers use radix of 10 (F×10^E);
 while binary numbers use radix of 2 (F×2^E).
• Examples: N=48
 r=10: N= 4,8 * 101 : F=4,8 ; E=1
 r=2: N= 48 = 1,5*25 : F=1,5 ; E =5

Pr. A.JBARI DSP implementation 10


Floating--point representation
Floating
IEEE-754 32-bit Single Precision Floating-Point Numbers
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.
The following 8 bits represent exponent (E): excess -127 (or bias -127)
The remaining 23 bits represents fraction (F).

N= -48 = -1,5*25
= 1 10000100 10000000000000000000000
= C2400000 h

The value (N) is calculated as follows:


Normalized form: For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127).
Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-126). These are in the denormalized form.
For E = 255, N represents special values, such as ±INF (infinity), NaN (not a number).
Pr. A.JBARI DSP implementation 11
Floating--point representation
Floating
IEEE-754 64-bit Double-Precision Floating-Point Numbers
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.
The following 11 bits represent exponent (E): excess-1023 (or bias -1023)
The remaining 52 bits represents fraction (F).

The value (N) is calculated as follows:


Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023).
Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022). These are in the denormalized form.
For E = 2047, N represents special values, such as ±INF (infinity), NaN (not a number).
Pr. A.JBARI DSP implementation 12
Big Endian vs. Little Endian

 Modern computers store one byte of data in each memory address or location,
i.e., byte addressable memory. An 32-bit integer is, therefore, stored in 4 memory
addresses.
 The term "Endian" refers to the order of storing bytes in computer memory. In "Big
Endian" scheme, the most significant byte is stored first in the lowest memory
address (or big in first), while "Little Endian" stores the least significant bytes in
the lowest memory address.
 Examples:
 The 32-bit integer 12345678H (221505317010) is stored as:
– in big endian: 12H 34H 56H 78H
– in little endian: 78H 56H 34H 12H.
 An 16-bit integer 00H 01H is interpreted as 0001H in big endian, and 0100H as
little endian.

Pr. A.JBARI DSP implementation 13


2. Digital Signal Processing

Analog & digital signals

Analog Digital
Discrete function Vk of discrete
Continuous function V of
sampling variable tk, with k =
continuous variable t (time,
Sampled integer: Vk = V(tk).
space etc) : V(t).
signal

Sampling
operation

Uniform (periodic) sampling.


Sampling frequency: fS = 1/ tS

Pr. A.JBARI DSP implementation 14


Digital vs analog processing
Digital Signal Processing (DSPing)
Advantages Limitations:
More flexible. A/D & signal processors speed:
Often easier system upgrade. wide-band signals still difficult
Data easily stored -memory. to treat (real-time systems).
Better control over accuracy Finite word-length effect.
requirements.
Reproducibility.
Linear phase
No drift with time and temperature
Pr. A.JBARI DSP implementation 15
A digital signal processing scheme

Analog filter: to limit the frequency range of analog signals prior to the sampling process
and attenuate aliasing distortion (Antialias filter).
 Analog-to-digital conversion (ADC) unit: to sample and convert band-limited signal into
the digital signal, which is discrete both in time and in amplitude.
Digital signal (DS) processor: processes the digital data according to DSP rules such as
lowpass, highpass, and bandpass digital filtering, or other algorithms for different
applications.
Digital-to-analog conversion (DAC) unit: Converts the processed digital signal to an
analog output signal which is continuous in time and discrete in amplitude.
Reconstruction (anti-image) filter: to smooth the DAC output voltage levels back to the
analog signal for real-world applications.

Pr. A.JBARI DSP implementation 16


Digital signal processing

Digital filtering:

The DSP block operates as


a simple digital lowpass filter.

Pr. A.JBARI DSP implementation 17


Digital signal processing
Signal Spectral (Frequency ) Analysis (Spectrum)

Pr. A.JBARI DSP implementation 18


Digital signal processing

Interference Cancellation in Electrocardiography:

Pr. A.JBARI DSP implementation 19


Typical DSP applications

communication systems astronomy


modulation/demodulation, channel VLBI, speckle interferometry
equalization, echo cancellation experimental physics
 consumer electronics sensor-data evaluation
perceptual coding of audio and video aviation
on DVDs, speech synthesis, speech radar, radio navigation
recognition
security
music steganography, digital watermarking,
synthetic instruments, audio effects, biometric identification, surveillance
noise reduction systems, signals intelligence, elec-
medical diagnostics tronic warfare
magnetic-resonance and ultrasonic engineering
imaging, computer tomography, control systems, feature extraction
ECG, EEG, MEG, AED, audiology for pattern recognition
geophysics
seismology, oil exploration
Pr. A.JBARI DSP implementation 20
Signal Sampling
Sampling process: Sample and hold circuit Architecture:

Open-Loop Architecture

Sample and hold:

Closed-Loop Architecture with


follower output

Pr. A.JBARI DSP implementation 21


Signal Sampling
The simplified sampling process The pulse train:
+∞

 =    −

=0

  =  

Spectral analysis:

  =  ∗ 


+∞

Continuous signal Sampled signal =   −  

=−∞

The sampled signal spectrum is the sum of the scaled original spectrum and
copies of its shifted versions, called replicas.
Pr. A.JBARI DSP implementation 22
Signal Sampling

Original baseband spectrum,

Sampled signal spectrum

 >  baseband spectrum and its


replicas are separated

 = 2 baseband spectrum and its


replicas are just connected

 < 2 baseband spectrum and its


replicas are overlapped

Pr. A.JBARI DSP implementation 23


Signal Sampling

 To obtain exact reconstruction of the original signal spectrum by applying


a lowpass reconstruction filter, the following condition must be satisfied:
 ≥ 2
 Shannon sampling theorem:
theorem:
For a uniformly sampled DSP system, an analog signal can be perfectly
recovered as long as the sampling rate is at least twice as large as the
highest-frequency component of the analog signal to be sampled.

 Half of the sampling frequency fs=2 is usually called the Nyquist frequency
(Nyquist limit), or folding frequency.

Pr. A.JBARI DSP implementation 24


Signal Reconstruction
Recover analog signal from its sampled signal version.

Condition: the spectrum of the sampled signal ys(t) contains the same spectral
content as the original spectrum X( f ).

Recovered
signal spectrum

Pr. A.JBARI DSP implementation 25


Digital Systems

 A discreet-time system is a device or algorithm that operates


on an input sequence according to some computational
procedure.
 It may be:
– A general purpose computer
– A microprocessor
– dedicated hardware
– A combination of all these

Pr. A.JBARI DSP implementation 26


Digital filters

Digital Filter: numerical procedure or algorithm that transforms a given sequence of


numbers into a second sequence that has some more desirable properties.

"   !  

The linear time-invariant digital filter can then be described by the linear
difference equation:
& &

 Finite Impulse Response (FIR) Filters:


Filters: ! =  #$ " − $ *+ =  #$ +−$
$='
$='
& )
 Infinite Impulse Response (IIR) Filters:
Filters: ! =  #$ " − $ −  ($ ! − $
$=' $=

,&
$=' #$ +
−$
*+ =
 + ,)
$=' ($ +
−$

Pr. A.JBARI DSP implementation 27


Digital Hardware implementation
TRANSVERSAL IMPLEMENTATION OF AN FIR FILTER (Tapped Delay Line)

Requirements: N memory locations for storing previous input.


Complexity: N+1 multiplications and N additions.

Pr. A.JBARI DSP implementation 28


Digital Hardware implementation
TRANSPOSED FIR FILTER IMPLEMENTATION

Pr. A.JBARI DSP implementation 29


Digital Hardware implementation

IIR FILTER DIRECT FORM 1

Pr. A.JBARI DSP implementation 30


Digital Hardware implementation
IIR FILTER DIRECT FORM 2

Pr. A.JBARI DSP implementation 31


Digital Hardware implementation
Parallel multiplier/accumulator cell fir filter implementation

Pr. A.JBARI DSP implementation 32


Discrete Fourier Transform (DFT)

• The DFT provides uniformly spaced samples of the Discrete-


Time Fourier Transform (DTFT).
• DFT definition:

2πnk N−
−1
1 2πnk
N −1 −j 1 j
X [k ] = ∑ x[n]e N x[n] =
N
∑ X [k ]e
n =0
N

n =0

• Requires:
– Complex multiplications: N2
– Complex additions: N(N-1)
– Real Multiplications: 4*N2
– Real additions: 2N (2N-1)

Pr. A.JBARI DSP implementation 33


Discrete Fourier Transform (DFT)

• Total computation complexity (complex operations):


T=N2 +N(N-1) = 2N2 – N ≡ O(N2)

Example: If each operation requires 1μs:


 N=1000 : T=2000000 operations = 2s
 N=5000 : T= 50 000 000 operations=50s

Although DFT is an efficient technique of obtaining the


frequency response of a sequence, it requires more number
of complex operations like additions and
multiplications.

Pr. A.JBARI DSP implementation 34


Faster DFT computation?

 Take advantage of the symmetry and periodicity of the


complex exponential (let WN=e-j2π/N)

– Symmetry: WNk [ N − n ] = WN− kn = (WNkn )*


– Periodicity: WNkn = WNk [ n + N ] = WN[ k + N ]n
– Recursion property: WN2 = WN / 2
 Note that two length N/2 DFTs take less computation than
one length N DFT: 2(N/2)2<N2
 Algorithms that exploit computational savings are
collectively called Fast Fourier Transforms.

Pr. A.JBARI DSP implementation 35


Decimation--in
Decimation in--Time Algorithm

 FFT: Fast Fourier Transform (Cooley and Tukey, 1965)


 Consider expressing DFT with even and odd input samples:
N −1
X [k ] = ∑ x[n]WNnk
n =0

= ∑ x[ n
n even
]W nk
N + ∑ x[
n odd
n ]W nk
N

N −1 N −1
2 2

= ∑ x[2r ](WN2 ) rk + WNk ∑ x[2r + 1](WN2 ) rk


r =0 r =0
N −1 N −1
2 2

= ∑ x[2r ]WNrk/ 2 + WNk ∑ x[2r + 1]WNrk/ 2


r =0 r =0

Pr. A.JBARI DSP implementation 36


FFT Algorithm
 Result is the sum of two N/2 length DFTs
X [k ] = G
{[k ] + WNk ⋅ H
{ [k ]
N/2 DFT N/2 DFT
of even samples of odd samples

 Then repeat decomposition of N/2 to N/4 DFTs, etc.


Cross feed of G[k] and H[k] in flow diagram is called a “butterfly”, due to shape:

WNr
Or simplify:
WN( r + N 2) WNr -1
(= −WNr )

Pr. A.JBARI DSP implementation 37


FFT Algorithm

For N=8 :

Pr. A.JBARI DSP implementation 38


Detail of “Butterfly”

 Cross feed of G[k] and H[k] in flow diagram is called a


“butterfly”, due to shape:

WNr
Or simplify:
WN( r + N 2 ) WNr -1
(= −WNr )

Pr. A.JBARI DSP implementation 39


FFT Algorithm

Repeat same process, Divide N/2-point DFTs into :


- Two N/4-points DFT
- Combine outputs

Pr. A.JBARI DSP implementation 40


FFT Algorithm

• After two steps of decimation in:

Pr. A.JBARI DSP implementation 41


FFT Algorithm
Flow graph for 8-point decimation in time:

The flow-graph consists of 3 stages


First stage computes the four 2-point DFTs
Second stage computes the two 4-point DFTs
Last stage computes the desired 8-point DFT

Pr. A.JBARI DSP implementation 42


FFT Algorithm
How much computation?
For N= 2M points:
Total of stages: M= log2 N
Total of butterflies: N/2
Each
Each butterfly: 1 complex addition and 2 complex multiplications.
Total computational complexity (complex operations)
T = 3N/2 * log2 N ≡ O(N log2 N)

Algorithm Complex multiplication Complex addition


DFT O(N2) O(N (N-1))
FFT O(N/2 *log2 N) O(N log2 N)

Pr. A.JBARI DSP implementation 43


FFT Algorithm

Pr. A.JBARI DSP implementation 44


Computation on DSP

 Input and Output data


– Real data in X memory
– Imaginary data in Y memory
 Coefficients (“twiddle” factors)
– cos (real) values in X memory
– sin (imag) values in Y memory
 Inverse computed with exponent sign change
and 1/N scaling

Pr. A.JBARI DSP implementation 45


Kernels for Digital Signal Processing

 Filtering, convolution: MAC (multiplication-accumulation)

Adaptation: MAD (multiplication-addition)

 Complex multiplication, FFT:

Viterbi decoding: ACS : Add Compare Select

Motion estimation

SAD : Sum of absolute difference

Pr. A.JBARI DSP implementation 46


3. DSP architectures and
features

Pr. A.JBARI DSP implementation 47


General Architectures
Accumulator architecture
Load-store architecture

Memory-register architecture
register on-chip
file memory

Pr. A.JBARI DSP implementation 48


Harvard architecture
VON NEUMANN Architecture:

• unified external memory for program


and data
• all operands in registers
HARVARD Architecture (IBM in 1944 at Harvard University):

• separate program and data memories


• operands also in memory
• concurrent access to
• instruction word
• one or several data words

Example: MPYF3 *(AR0)++, *(AR1)++, R0


instruction from data from data
memory memory store result
from in data
program (address in (address in
address address register R0
memory
register AR0 register AR1)
Pr. A.JBARI DSP implementation 49
Classic DSP characteristics
Explicit parallelism
– Harvard architecture for concurrent data access
– concurrent operations on data and addresses
Optimized control flow and background processing
– zero-overhead loops
– DMA controllers
Special addressing modes
– distinction of address, data and modifier registers
– versatile address computation for indirect addressing
 Specialized instructions
– single-cycle hardware multiplier
– multiply accumulate instruction (MAC)

Pr. A.JBARI DSP implementation 50


Specialized addressing modes

 many DSPs distinguish address registers from data registers


Additional ALUs for address computations
– useful for indirect addressing (register points to operand in memory)
ADDF3 *(AR0)++, R1, R1
– operations on address registers in parallel with operations on data
registers, no extra cycles
– behavior depends on instruction and contents of special purpose
registers (modifier registers)
Typical address update functions
– increment/decrement by 1 (AR0++, AR0--)
– increment/decrement by constant specified in modifier register
(AR0 += MR0, AR0 -= MR5)
– circular addressing (AR0 += 1 if AR0 < upper limit, else AR0 = base address),
– bit-reverse addressing, …

Pr. A.JBARI DSP implementation 51


Circular addressing
Goal: implementation of ring buffers in linear address space
– implementation variants
copy data with data access, or
use circular addressing (don’t copy data, wrap pointers)
– supported by addressing modes
data access and move operations
increment operators that wrap around at buffer boundaries

Pr. A.JBARI DSP implementation 52


Bit
Bit--reverse addressing
Goal: accelerate FFT operation
very important DSP operation
transforms signals between time and frequency representations

other method to compute


basic operation in many mirror bits addresses, add N/2 with
DSP algorithms (bit reverse) reverse carry arithmetic

Pr. A.JBARI DSP implementation 53


Zero-overhead loops

example:
Goal
add first 100 values in array a
– reduce overhead for executing loops
and store result in R1
– general purpose processors
 initialize loop counter TMS320C3x-like assembler
 execute loop body
check loop exit condition
LDI @a, AR0!
 branch to loop start or exit loop
LDI 0.0, R1!
– digital signal processors
RPTS 99!
initialize loop counter
ADDF3 *(AR0)++, R1, R1!
execute loop body

 check loop exit condition
branch to loop start or exit loop
RPTS N repeats next
instruction N-1 times

Pr. A.JBARI DSP implementation 54


DSP Architecture
The DSP can fetch the
program instruction and
data in parallel at the same
time.

The multiplier and


accumulator
(MAC), is used for the
digital
filtering operation.

The shift unit, is used for


the scaling operation for
fixed-point implementation
when the processor
performs digital
filtering.
Pr. A.JBARI DSP implementation 55
ARITHMETIC LOGIC UNIT (ALU) FEATURES

■ Add, Subtract, Negate, Increment, Decrement, Absolute Value, AND, OR, Exclusive
OR, NOT
■ Bitwise Operators, Constant Operators
■ Multi-Precision Math Capabilites
■ Divide Primitives
■ Saturation Mode for Overflow Support
■ Background Registers for Single-Cycle Context Switch
■ Example Instructions:
◆ IF EQ AR = AX0 + AY0;
◆ AF = MR1 XOR AY1;
◆ AR = TGLBIT 7 OF AX1;

Pr. A.JBARI DSP implementation 56


MULTIPLY--ACCUMULATOR (MAC) FEATURES
MULTIPLY

■ Single-Cycle Multiply, Multiply-Add, Multiply-Subtract

■ 40-Bit Accumulator for Overflow Protection (219x Adds Second 40-Bit Accumulator)

■ Saturation Instruction Performs Single Cycle Overflow Cleanup

■ Background Registers for Single-Cycle Context Switch

■ Example MAC Instructions:


◆ MR = MX0 * MY0(US);

◆ IF MV SAT MR;

◆ MR = MR - AR * MY1(SS);

◆ MR = MR + MX1 * MY0(RND);

◆ IF LT MR = MX0 * MX0(UU);

Pr. A.JBARI DSP implementation 57


Hardware Multiply
Multiply//Accumulate (MAC) Unit

Specialized data-path for DSP

MAC Instructions
[IF cond] MR|MF
= xop * yop ; Multiply
= MR + xop * yop ; Multiply/Accumulate
= MR – xop * yop ; Multiply/Subtract
= MR ; Transfer MR
= 0 ; Clear
IF MV SAT MR ; Conditional MR Saturation
Pr. A.JBARI DSP implementation 58
Hardware Multiply
Multiply//Accumulate (MAC) Unit
Two types of parallel MAC:

Componentwise accumulation. across-component accumulation.

Pr. A.JBARI DSP implementation 59


ADSP--2100 Family DSP Microcomputers
ADSP
16-Bit Fixed-Point DSP Microprocessors with On-Chip Memory
Enhanced Harvard Architecture for Three-Bus
Performance: Instruction Bus & Dual Data Buses
Independent Computation Units: ALU, Multiplier/Accumulator, and Shifter
Single-Cycle Instruction Execution & Multifunction Instructions
On-Chip Program Memory RAM or ROM & Data Memory RAM
Integrated I/O Peripherals: Serial Ports, Timer, Host Interface Port (ADSP-2111 Only)
25 MIPS, 40 ns Maximum Instruction Rate
Separate On-Chip Buses for Program and Data Memory
Program Memory Stores Both Instructions and Data
Dual Data Address Generators with Modulo and
Bit-Reverse Addressing
Efficient Program Sequencing with Zero-Overhead
Looping: Single-Cycle Loop Setup
Pr. A.JBARI DSP implementation 60
ADSP--2100 Family DSP Microcomputers
ADSP

Pr. A.JBARI DSP implementation 61


Basic architecture of TMS320C54x family.

The fixed-point TMS320C54x


families supporting 16-bit data
have on-chip program memory
and data memory in various sizes
and configurations

The typical TMS320C54x


fixed-point DSP
architecture.

Pr. A.JBARI DSP implementation 62


The typical TMS320C3x floating-
floating-point DSP.

Pr. A.JBARI DSP implementation 63


Block diagram of TMS320C67x floating-
floating-point
DSP.

Pr. A.JBARI DSP implementation 64


Registers of the TMS320C67x floating-point
DSP.

Pr. A.JBARI DSP implementation 65


DSP range of applications

Pr. A.JBARI DSP implementation 66


Data representation for different DSP

DSP device Word length (no. of bits) Representation format


Texas Instruments TMS320C30 32 Floating point
Texas Instruments TMS320C54x 16 Fixed point
Texas Instruments TMS320C62xxx 16 Fixed point
Texas Instruments TMS320C67xxx 32 Floating point
Analog Devices DSP-2110 16 Fixed point
Analog Devices SHARC-21061 32 Floating point
Motorola DSP56001/2/9 24 Fixed point
Motorola DSP96000 32 Floating point
Lucent Technologies DSP1600 16 Fixed point
Lucent Technologies DSP16000 32 Fixed point

Pr. A.JBARI DSP implementation 67


DSP 320C6713

Pr. A.JBARI DSP implementation 68


TMS320C6713

Highest-Performance Floating-Point Digital VelociTI Advanced Very Long Instruction


Signal Processor (DSP): TMS320C6713 Word (VLIW) TMS320C67x DSP Core
– Eight 32-Bit Instructions/Cycle – Eight Independent Functional Units:
– 32/64-Bit Data Word – Two ALUs (Fixed-Point)
– 225-MHz (GDP), 150-MHz (PYP) Clock – Four ALUs (Floating- and Fixed-Point)
Rates – Two Multipliers (Floating- and
– 4.4-, 6.7-ns Instruction Cycle Time Fixed-Point)
– 1800 MIPS/1350 MFLOPS, – Load-Store Architecture With 32 32-Bit
1200 MIPS /900 MFLOPS General-Purpose Registers
– Rich Peripheral Set, Optimized for Audio – Instruction Packing Reduces Code Size
– Highly Optimized C/C++ Compiler – All Instructions Conditional

Pr. A.JBARI DSP implementation 69


TMS320C6713
L1/L2 Memory Architecture Two Multichannel Buffered Serial Ports
– 4K-Byte L1P Program Cache Two 32-Bit General-Purpose Timers
(Direct-Mapped) Dedicated GPIO Module With 16 pins
– 4K-Byte L1D Data Cache (2-Way) (External Interrupt Capable)
– 256K-Byte L2 Memory Total: 64K-Byte Flexible Phase-Locked-Loop (PLL) Based
L2 Unified Cache/Mapped RAM, and Clock Generator Module
192K-Byte Additional L2 Mapped RAM IEEE-1149.1 (JTAG†)
Device Configuration Boundary-Scan-Compatible
– Boot Mode: HPI, 8-, 16-, 32-Bit ROM Boot Package Options:
– Endianness: Little Endian, Big Endian – 208-Pin PowerPAD Plastic (Low-Profile)
16-Bit Host-Port Interface (HPI) Quad Flatpack (PYP)
Two Multichannel Audio Serial Ports – 272-Ball, Ball Grid Array Package (GDP)
(McASPs) 0.13-μm/6-Level Copper Metal Process
Two Inter-Integrated Circuit Bus (I2C Bus) – CMOS Technology
Multi-Master and Slave Interfaces 3.3-V I/Os, 1.2-V Internal (PYP)
3.3-V I/Os, 1.26-V Internal (GDP)

Pr. A.JBARI DSP implementation 70


TMS320C67x Block Diagram
functional block
and CPU (DSP core)
diagram

Pr. A.JBARI DSP implementation 71


TMS320C67x Block Diagram
One instruction is 32
bits. Program bus is 256 bits
wide.
- Can execute up to 8
instructions per clock cycle
(225MHz->4.4ns clock cycle).
8 independent functional
units:
- 2 multipliers
- 6 ALUs
Code is efficient if all 8
functional units are always
busy.
Register files each have 16
general purpose registers,
each 32-bits wide (A0-A15,
B0-B15).
Data paths are each 64 bits
wide.

Pr. A.JBARI DSP implementation 72


C6713 Functional Units

 Two data paths (A & B)


 Data path A
Multiply operations (.M1)
 Logical and arithmetic operations (.L1)
Branch, bit manipulation, and arithmetic operations (.S1)
Loading/storing and arithmetic operations (.D1)
Data path B
Multiply operations (.M2)
Logical and arithmetic operations (.L2)
Branch, bit manipulation, and arithmetic operations (.S2)
Loading/storing and arithmetic operations (.D2)
All data (not program) transfers go through .D1 and .D2

Pr. A.JBARI DSP implementation 73


Fetch & Execute Packets

C6713 fetches 8 instructions at a time (256 bits)


 Definition: “Fetch packet” is a group of 8 instructions fetched at once.
 Coincidentally, C6713 has 8 functional units.
 Ideally, all 8 instructions would be executed in parallel.
Often this isn’t possible, e.g.:
3 multiplies (only two .M functional units)
 Results of instruction 3 needed by instruction 4 (must wait for 3 to
complete)

Pr. A.JBARI DSP implementation 74


Execute Packets
Definition: “Execute Packet” is a group of (8 or less) consecutive instructions
in one fetch packet that can be executed in parallel.

C compiler provides a flag to indicate which instructions should be run in


parallel.
 You have to do this manually in Assembly using “||”.

Pr. A.JBARI DSP implementation 75


C6713 Instruction Pipeline Overview
All instructions flow through the following steps:
1. Fetch
a) PG: Program address Generate
b) PS: Program address Send
c) PW: Program address ready Wait
d) PR: Program fetch packet Receive
2. Decode
a) DP: Instruction DisPatch
each step
b) DC: Instruction DeCode = 1 clock cycle
3. Execute
a) 10 phases labeled E1-E10
b) Fixed point processors have only 5 phases (E1-E5)

Pr. A.JBARI DSP implementation 76


Pipelining: Ideal Operation

Remarks:
• At clock cycle 11, the pipeline is “full”
• There are no holes (“bubbles”) in the pipeline in this example

Pr. A.JBARI DSP implementation 77


Pipelining: “Actual
“Actual”” Operation

Remarks:
• Fetch packet n has 3 execution packets
• All subsequent fetch packets have 1 execution packet
• Notice the holes/bubbles in the pipeline caused by lack of parallelization
Pr. A.JBARI DSP implementation 78
Execute Stage of C6713 Pipeline

C67x has 10 execute phases (floating point)

C62x/C64x have 5 execute phases (fixed point)


Different types of instructions require different numbers of these phases to
complete their execution
Anywhere between 1 and all 10 phases
Most instruction tie up their functional unit for only one phase (E1)

Pr. A.JBARI DSP implementation 79


Execution Stage Examples (1)

results available
after E1 (zero delay slots)

Functional unit free


after E1 (1 functional
unit latency)

Pr. A.JBARI DSP implementation 80


Execution Stage Examples (2)

results available after


E4 (3 delay slots)
Functional unit free after E1
(1 functional unit latency)

Pr. A.JBARI DSP implementation 81


Execution Stage Examples (3)

Results available after


E10 (9 delay slots)
Functional unit free after E4
(4 functional unit latency)
Pr. A.JBARI DSP implementation 82
Functional Latency & Delay Slots

 Functional Latency: How long must we wait for the functional unit to be free?
 Delay Slots: How long must we wait for the result?
 General remarks:
Functional unit latency <= Delay slots
Strange results will occur in ASM code if you don’t pay attention to delay
slots and functional unit latency
 All problems can be resolved by “waiting” with NOPs
 Efficient ASM code tries to keep functional units busy all of the time.
 Efficient code is hard to write (and follow).

Pr. A.JBARI DSP implementation 83


DSP TMS 320C6713
Floating-Point Digital Signal Processor Device nomenclature

Pr. A.JBARI DSP implementation 84


C6713 DSP Starter Kit (DSK)

The TMS320C6713 DSP Starter Kit (DSK)


developed jointly with Spectrum Digital
is a low-cost development platform
designed to speed the development of
high precision applications based on
TI´s TMS320C6000 floating point DSP
generation.
Pr. A.JBARI DSP implementation 85
C6713 DSP Starter Kit (DSK)
Hardware Feature:
Texas Instrument's TMS320C6713 DSP operating at 225 Mhz
Embedded USB JTAG controller with plug and play drivers,
USB cable included
TLV320AIC codec
2M x 32 on board SDRAM
512K bytes of on board Flash ROM
3 expansion connectors (Memory Interface, Peripheral Interface, and Host Port Interface)
On board IEEE 1149.1 JTAG connection for optional emulator debug
Four 3.5 mm. audio jacks (microphone, line-in, speaker, and line out)
4 user definable LEDs
4 position dip switch, user definable
+5 Volt operation only, power supply included
Size: 8.25" x 4.5" (210 x 115 mm), 0.062" thick, 6 layers
Compatible with Spectrum Digital's DSK Wire Wrap Prototype Card
Software Features:
TMS320C6713 DSK specific Code Composer Studio from Texas Instruments
Test/sample code provided to reduce coding time
Compatible with National Instruments LabView Embedded 2.0
Compatible with JTAG emulators from Spectrum Digital
Compatible with Win 2000/XP

Pr. A.JBARI DSP implementation 86


C6713 DSP Starter Kit (DSK)

Pr. A.JBARI DSP implementation 87


C6713 DSP Starter Kit (DSK)

Pr. A.JBARI DSP implementation 88


Is my DSK working?

DSK Power On Self Test


 Power up DSK and watch LEDs
 Power On Self Test (POST) program stored in FLASH memory automatically
executes
 POST takes 10-15 seconds to complete
 All DSK subsystems are automatically tested
 During POST, a 1kHz sinusoid is output from the AIC23 codec for 1 second
 Listen with headphones or watch on oscilloscope
 If POST is successful, all four LEDs blink 3 times and then remain on.

Pr. A.JBARI DSP implementation 89


Interfacing with the Real World

TMS320C6713 DSK:
digital inputs = 4 DIP switches
digital outputs = 4 LEDs
ADC and DAC = AIC23 codec

Pr. A.JBARI DSP implementation 90

You might also like