DSP Cours V2 PDF

2016/2017
Filière : ING en Génie Electrique

Semestre : S4
Higher School of Technical Education

Electrical Engineering Department
Rabat, Morocco
Elément de module:
Traitement du signal et implémentation sur DSP
Digital Signal Processing and Applications with the
TMS320C6713 DSK
Prof A.JBARI
atmjbari@gmail.com
Pr. A.JBARI DSP implementation 1

Objectives
Become familiar with

DSP basics
TMS320C6713 floating point DSP architecture
TMS320C6713 DSP starter kit (DSK)
Code composer studio integrated development environment (IDE)
Matlab design and analysis tools
Learn how to program the C6713
Writing and compiling code
Fixing errors
Downloading code to the target and executing
Debugging
Write and run useful programs on the C6713 DSK
Learn about DSP applications
Learn where to find help

Bibliography
Books:
– Digital Signal Processing Using MATLAB®, Third Edition, Vinay K. Ingle and John G.
Proakis, (Cengage Learning 200 First Stamford Place, Suite 400, Stamford, CT 06902,
USA)
– Digital Signal Processing Fundamentals and Applications, Li Tan, DeVry University
Decatur, Georgia, Copyright 2008, Elsevier Inc
– Digital Signal Processing and Applications with the C6713 and C6416 DSK, Rulph
Chassaing,Worcester Polytechnic Institute, Copyright © 2005 by John Wiley & Sons, Inc.
– DSP Applications Using C and the TMS320C6x DSK, Rulph Chassaing, Copyright © 2002
by John Wiley & Sons, Inc.
Web documents and links:
• http://www.analog.com/media/en/technical-documentation/data-sheets/ADSP-
2101_2103_2105_2115.pdf?doc=AD7475_7495.pdf
• ftp://ftp.analog.com/pub/cftl/ADI%20Classics/Mixed%20Signal%20and%20DSP%20Design%20Tec
hniques,%202000/Section_7_DSP_Hardware.pdf
• dspworkshop_part1_2007.pdf , dspworkshop_part2_2007.pdf
• …

Outline
1. Data representation
2. Digital Signal Processing
a) Digital Systems and fundmantals
b) Fast Fourier Transform (FFT)
c) Digital Filters Architectures
3. DSP Architectures
4. Programmation of DSP TMS320C6713
5. Applications

1. Data representation
Integer representation
Unsigned Integers: can represent zero and positive integers.

Signed Integers: can represent zero, positive and negative integers.
The most-significant bit (msb) is called the sign bit. The sign bit is used to represent
the sign of the integer - with 0 for positive integers and 1 for negative integers.
Representation Positive Negative

Sign-Magnitude representation [(MSB=0) N] [(MSB=1) magnitude(N)]
1's Complement representation [(MSB=0) N] [(MSB=1) 1’s complement(N)]
2's Complement representation [(MSB=0) N] [(MSB=1) 2’s complement(N)]
Excess (or bias) representation B

N+B
Example:
27=0 0 0 1 1 0 1 1
Representation Positive Negative

Sign-Magnitude representation 27 = 00011011 -27= 10011011
1's Complement representation 27= 00011011 -27=11100100
2's Complement representation 27= 00011011 -27=11100101
Excess (or bias) representation 27 27+63=90 = -27 -27+63=36 =

63 01011010 00100100

Types: Computer can manipulate integers of various lengths (format):
Format Type Range

8 bits byte (in Java, char type in C, C++) [ -127 128]
16 bits short (in Java, C and C++) [-32767 32768]
32 bits int (in Java, C and C++) [-231 231 - 1]

64 bits long (in Java, C and C++) [ -263 and 263 - 1]
128 bits long long (in C and C++)

Sign extension
Sign extension (widening conversion):
– 8 bit 2s compl. repr. for 7 is: 00000111
– 16 bit 2s compl. repr. for 7 is: 0000000000000111
– 8 bit 2s compl. repr. for -7 is: 11111001
– 16 bit 2s compl. repr. for -7 is: 1111111111111001

Fixed--point representation
Fixed
To define a fixed point type conceptually, all we need are two parameters:
width of the number representation, and binary point position within the
number
Real number= integer part
· Fractional part
Notation : fixed<w,b> : w denotes the number of bits used as a whole

(the Width of a number), and b denotes the position of binary point
counting from the least significant bit (counting from 0).
• Examples:
N = 15,75
Format fixed(8,3): 01111110
Format fixed(32,16): 00000000000011111100000000000000

Floating--point representation
Floating
A floating-point number is typically expressed in the scientific notation with
a fraction (F) (mantissa), and an exponent (E) of a certain radix (r), in the form of
N= F× r ^ E:
Decimal numbers use radix of 10 (F×10^E);
while binary numbers use radix of 2 (F×2^E).
• Examples: N=48
r=10: N= 4,8 * 101 : F=4,8 ; E=1
r=2: N= 48 = 1,5*25 : F=1,5 ; E =5

Floating
IEEE-754 32-bit Single Precision Floating-Point Numbers
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.
The following 8 bits represent exponent (E): excess -127 (or bias -127)
The remaining 23 bits represents fraction (F).
N= -48 = -1,5*25
= 1 10000100 10000000000000000000000
= C2400000 h
The value (N) is calculated as follows:

Normalized form: For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127).
Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-126). These are in the denormalized form.
For E = 255, N represents special values, such as ±INF (infinity), NaN (not a number).
Floating
IEEE-754 64-bit Double-Precision Floating-Point Numbers
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.
The following 11 bits represent exponent (E): excess-1023 (or bias -1023)
The remaining 52 bits represents fraction (F).
The value (N) is calculated as follows:

Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023).
Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022). These are in the denormalized form.
For E = 2047, N represents special values, such as ±INF (infinity), NaN (not a number).
Big Endian vs. Little Endian
Modern computers store one byte of data in each memory address or location,
i.e., byte addressable memory. An 32-bit integer is, therefore, stored in 4 memory
addresses.
The term "Endian" refers to the order of storing bytes in computer memory. In "Big
Endian" scheme, the most significant byte is stored first in the lowest memory
address (or big in first), while "Little Endian" stores the least significant bytes in
the lowest memory address.
Examples:
The 32-bit integer 12345678H (221505317010) is stored as:
– in big endian: 12H 34H 56H 78H
– in little endian: 78H 56H 34H 12H.
An 16-bit integer 00H 01H is interpreted as 0001H in big endian, and 0100H as
little endian.

2. Digital Signal Processing
Analog & digital signals
Analog Digital
Discrete function Vk of discrete
Continuous function V of
sampling variable tk, with k =
continuous variable t (time,
Sampled integer: Vk = V(tk).
space etc) : V(t).
signal
Sampling
operation
Uniform (periodic) sampling.

Sampling frequency: fS = 1/ tS

Digital vs analog processing
Digital Signal Processing (DSPing)
Advantages Limitations:
More flexible. A/D & signal processors speed:
Often easier system upgrade. wide-band signals still difficult
Data easily stored -memory. to treat (real-time systems).
Better control over accuracy Finite word-length effect.
requirements.
Reproducibility.
Linear phase
No drift with time and temperature
A digital signal processing scheme
Analog filter: to limit the frequency range of analog signals prior to the sampling process
and attenuate aliasing distortion (Antialias filter).
Analog-to-digital conversion (ADC) unit: to sample and convert band-limited signal into
the digital signal, which is discrete both in time and in amplitude.
Digital signal (DS) processor: processes the digital data according to DSP rules such as
lowpass, highpass, and bandpass digital filtering, or other algorithms for different
applications.
Digital-to-analog conversion (DAC) unit: Converts the processed digital signal to an
analog output signal which is continuous in time and discrete in amplitude.
Reconstruction (anti-image) filter: to smooth the DAC output voltage levels back to the
analog signal for real-world applications.

Digital signal processing
Digital filtering:
The DSP block operates as

a simple digital lowpass filter.

Signal Spectral (Frequency ) Analysis (Spectrum)

Interference Cancellation in Electrocardiography:

Typical DSP applications
communication systems astronomy

modulation/demodulation, channel VLBI, speckle interferometry
equalization, echo cancellation experimental physics
consumer electronics sensor-data evaluation
perceptual coding of audio and video aviation
on DVDs, speech synthesis, speech radar, radio navigation
recognition
security
music steganography, digital watermarking,
synthetic instruments, audio effects, biometric identification, surveillance
noise reduction systems, signals intelligence, elec-
medical diagnostics tronic warfare
magnetic-resonance and ultrasonic engineering
imaging, computer tomography, control systems, feature extraction
ECG, EEG, MEG, AED, audiology for pattern recognition
geophysics
seismology, oil exploration
Signal Sampling
Sampling process: Sample and hold circuit Architecture:
Open-Loop Architecture
Sample and hold:
Closed-Loop Architecture with

follower output

Signal Sampling
The simplified sampling process The pulse train:
+∞
= −

=0
=
Spectral analysis:
= ∗

+∞

Continuous signal Sampled signal = −

=−∞
The sampled signal spectrum is the sum of the scaled original spectrum and
copies of its shifted versions, called replicas.
Signal Sampling
Original baseband spectrum,
Sampled signal spectrum
> baseband spectrum and its

replicas are separated
= 2 baseband spectrum and its

replicas are just connected
< 2 baseband spectrum and its

replicas are overlapped

Signal Sampling
To obtain exact reconstruction of the original signal spectrum by applying

a lowpass reconstruction filter, the following condition must be satisfied:
≥ 2
Shannon sampling theorem:
theorem:
For a uniformly sampled DSP system, an analog signal can be perfectly
recovered as long as the sampling rate is at least twice as large as the
highest-frequency component of the analog signal to be sampled.
Half of the sampling frequency fs=2 is usually called the Nyquist frequency
(Nyquist limit), or folding frequency.

Signal Reconstruction
Recover analog signal from its sampled signal version.
Condition: the spectrum of the sampled signal ys(t) contains the same spectral
content as the original spectrum X( f ).
Recovered
signal spectrum

Digital Systems
A discreet-time system is a device or algorithm that operates

on an input sequence according to some computational
procedure.
It may be:
– A general purpose computer
– A microprocessor
– dedicated hardware
– A combination of all these

Digital filters
Digital Filter: numerical procedure or algorithm that transforms a given sequence of

numbers into a second sequence that has some more desirable properties.
" !
The linear time-invariant digital filter can then be described by the linear
difference equation:
& &
Finite Impulse Response (FIR) Filters:

Filters: ! = #$ " − $ *+ = #$ +−$
$='
$='
& )
Infinite Impulse Response (IIR) Filters:
Filters: ! = #$ " − $ − ($ ! − $
$=' $=
,&
$=' #$ +
−$
*+ =
+ ,)
$=' ($ +
−$

Digital Hardware implementation
TRANSVERSAL IMPLEMENTATION OF AN FIR FILTER (Tapped Delay Line)
Requirements: N memory locations for storing previous input.

Complexity: N+1 multiplications and N additions.

TRANSPOSED FIR FILTER IMPLEMENTATION

IIR FILTER DIRECT FORM 1

IIR FILTER DIRECT FORM 2

Parallel multiplier/accumulator cell fir filter implementation

Discrete Fourier Transform (DFT)
• The DFT provides uniformly spaced samples of the Discrete-

Time Fourier Transform (DTFT).
• DFT definition:
2πnk N−
−1
1 2πnk
N −1 −j 1 j
X [k ] = ∑ x[n]e N x[n] =
N
∑ X [k ]e
n =0
N
n =0
• Requires:
– Complex multiplications: N2
– Complex additions: N(N-1)
– Real Multiplications: 4*N2
– Real additions: 2N (2N-1)

Discrete Fourier Transform (DFT)
• Total computation complexity (complex operations):

T=N2 +N(N-1) = 2N2 – N ≡ O(N2)
Example: If each operation requires 1μs:

N=1000 : T=2000000 operations = 2s
N=5000 : T= 50 000 000 operations=50s
Although DFT is an efficient technique of obtaining the

frequency response of a sequence, it requires more number
of complex operations like additions and
multiplications.

Faster DFT computation?
Take advantage of the symmetry and periodicity of the

complex exponential (let WN=e-j2π/N)
– Symmetry: WNk [ N − n ] = WN− kn = (WNkn )*

– Periodicity: WNkn = WNk [ n + N ] = WN[ k + N ]n
– Recursion property: WN2 = WN / 2
Note that two length N/2 DFTs take less computation than
one length N DFT: 2(N/2)2<N2
Algorithms that exploit computational savings are
collectively called Fast Fourier Transforms.

Decimation--in
Decimation in--Time Algorithm
FFT: Fast Fourier Transform (Cooley and Tukey, 1965)

Consider expressing DFT with even and odd input samples:
N −1
X [k ] = ∑ x[n]WNnk
n =0
= ∑ x[ n
n even
]W nk
N + ∑ x[
n odd
n ]W nk
N
N −1 N −1
2 2
= ∑ x[2r ](WN2 ) rk + WNk ∑ x[2r + 1](WN2 ) rk

r =0 r =0
N −1 N −1
2 2
= ∑ x[2r ]WNrk/ 2 + WNk ∑ x[2r + 1]WNrk/ 2

r =0 r =0

FFT Algorithm
Result is the sum of two N/2 length DFTs
X [k ] = G
{[k ] + WNk ⋅ H
{ [k ]
N/2 DFT N/2 DFT
of even samples of odd samples
Then repeat decomposition of N/2 to N/4 DFTs, etc.

Cross feed of G[k] and H[k] in flow diagram is called a “butterfly”, due to shape:
WNr
Or simplify:
WN( r + N 2) WNr -1
(= −WNr )

FFT Algorithm
For N=8 :

Detail of “Butterfly”
Cross feed of G[k] and H[k] in flow diagram is called a

“butterfly”, due to shape:
WNr
Or simplify:
WN( r + N 2 ) WNr -1
(= −WNr )

FFT Algorithm
Repeat same process, Divide N/2-point DFTs into :

- Two N/4-points DFT
- Combine outputs

FFT Algorithm
• After two steps of decimation in:

FFT Algorithm
Flow graph for 8-point decimation in time:
The flow-graph consists of 3 stages

First stage computes the four 2-point DFTs
Second stage computes the two 4-point DFTs
Last stage computes the desired 8-point DFT

FFT Algorithm
How much computation?
For N= 2M points:
Total of stages: M= log2 N
Total of butterflies: N/2
Each
Each butterfly: 1 complex addition and 2 complex multiplications.
Total computational complexity (complex operations)
T = 3N/2 * log2 N ≡ O(N log2 N)
Algorithm Complex multiplication Complex addition

DFT O(N2) O(N (N-1))
FFT O(N/2 *log2 N) O(N log2 N)

FFT Algorithm

Computation on DSP
Input and Output data

– Real data in X memory
– Imaginary data in Y memory
Coefficients (“twiddle” factors)
– cos (real) values in X memory
– sin (imag) values in Y memory
Inverse computed with exponent sign change
and 1/N scaling

Kernels for Digital Signal Processing
Filtering, convolution: MAC (multiplication-accumulation)
Adaptation: MAD (multiplication-addition)
Complex multiplication, FFT:
Viterbi decoding: ACS : Add Compare Select
Motion estimation
SAD : Sum of absolute difference

3. DSP architectures and
features

General Architectures
Accumulator architecture
Load-store architecture
Memory-register architecture
register on-chip
file memory

Harvard architecture
VON NEUMANN Architecture:
• unified external memory for program

and data
• all operands in registers
HARVARD Architecture (IBM in 1944 at Harvard University):
• separate program and data memories

• operands also in memory
• concurrent access to
• instruction word
• one or several data words
Example: MPYF3 *(AR0)++, *(AR1)++, R0

instruction from data from data
memory memory store result
from in data
program (address in (address in
address address register R0
memory
register AR0 register AR1)
Classic DSP characteristics
Explicit parallelism
– Harvard architecture for concurrent data access
– concurrent operations on data and addresses
Optimized control flow and background processing
– zero-overhead loops
– DMA controllers
Special addressing modes
– distinction of address, data and modifier registers
– versatile address computation for indirect addressing
Specialized instructions
– single-cycle hardware multiplier
– multiply accumulate instruction (MAC)

Specialized addressing modes
many DSPs distinguish address registers from data registers

Additional ALUs for address computations
– useful for indirect addressing (register points to operand in memory)
ADDF3 *(AR0)++, R1, R1
– operations on address registers in parallel with operations on data
registers, no extra cycles
– behavior depends on instruction and contents of special purpose
registers (modifier registers)
Typical address update functions
– increment/decrement by 1 (AR0++, AR0--)
– increment/decrement by constant specified in modifier register
(AR0 += MR0, AR0 -= MR5)
– circular addressing (AR0 += 1 if AR0 < upper limit, else AR0 = base address),
– bit-reverse addressing, …

Circular addressing
Goal: implementation of ring buffers in linear address space
– implementation variants
copy data with data access, or
use circular addressing (don’t copy data, wrap pointers)
– supported by addressing modes
data access and move operations
increment operators that wrap around at buffer boundaries

Bit
Bit--reverse addressing
Goal: accelerate FFT operation
very important DSP operation
transforms signals between time and frequency representations
other method to compute

basic operation in many mirror bits addresses, add N/2 with
DSP algorithms (bit reverse) reverse carry arithmetic

Zero-overhead loops
example:
Goal
add first 100 values in array a
– reduce overhead for executing loops
and store result in R1
– general purpose processors
initialize loop counter TMS320C3x-like assembler
execute loop body
check loop exit condition
LDI @a, AR0!
branch to loop start or exit loop
LDI 0.0, R1!
– digital signal processors
RPTS 99!
initialize loop counter
ADDF3 *(AR0)++, R1, R1!
execute loop body
…
check loop exit condition
branch to loop start or exit loop
RPTS N repeats next
instruction N-1 times

DSP Architecture
The DSP can fetch the
program instruction and
data in parallel at the same
time.
The multiplier and

accumulator
(MAC), is used for the
digital
filtering operation.
The shift unit, is used for

the scaling operation for
fixed-point implementation
when the processor
performs digital
filtering.
ARITHMETIC LOGIC UNIT (ALU) FEATURES
■ Add, Subtract, Negate, Increment, Decrement, Absolute Value, AND, OR, Exclusive
OR, NOT
■ Bitwise Operators, Constant Operators
■ Multi-Precision Math Capabilites
■ Divide Primitives
■ Saturation Mode for Overflow Support
■ Background Registers for Single-Cycle Context Switch
■ Example Instructions:
◆ IF EQ AR = AX0 + AY0;
◆ AF = MR1 XOR AY1;
◆ AR = TGLBIT 7 OF AX1;

MULTIPLY--ACCUMULATOR (MAC) FEATURES
MULTIPLY
■ Single-Cycle Multiply, Multiply-Add, Multiply-Subtract
■ 40-Bit Accumulator for Overflow Protection (219x Adds Second 40-Bit Accumulator)
■ Saturation Instruction Performs Single Cycle Overflow Cleanup
■ Background Registers for Single-Cycle Context Switch
■ Example MAC Instructions:

◆ MR = MX0 * MY0(US);
◆ IF MV SAT MR;
◆ MR = MR - AR * MY1(SS);
◆ MR = MR + MX1 * MY0(RND);
◆ IF LT MR = MX0 * MX0(UU);

Hardware Multiply
Multiply//Accumulate (MAC) Unit
Specialized data-path for DSP
MAC Instructions
[IF cond] MR|MF
= xop * yop ; Multiply
= MR + xop * yop ; Multiply/Accumulate
= MR – xop * yop ; Multiply/Subtract
= MR ; Transfer MR
= 0 ; Clear
IF MV SAT MR ; Conditional MR Saturation
Hardware Multiply
Multiply//Accumulate (MAC) Unit
Two types of parallel MAC:
Componentwise accumulation. across-component accumulation.

ADSP--2100 Family DSP Microcomputers
ADSP
16-Bit Fixed-Point DSP Microprocessors with On-Chip Memory
Enhanced Harvard Architecture for Three-Bus
Performance: Instruction Bus & Dual Data Buses
Independent Computation Units: ALU, Multiplier/Accumulator, and Shifter
Single-Cycle Instruction Execution & Multifunction Instructions
On-Chip Program Memory RAM or ROM & Data Memory RAM
Integrated I/O Peripherals: Serial Ports, Timer, Host Interface Port (ADSP-2111 Only)
25 MIPS, 40 ns Maximum Instruction Rate
Separate On-Chip Buses for Program and Data Memory
Program Memory Stores Both Instructions and Data
Dual Data Address Generators with Modulo and
Bit-Reverse Addressing
Efficient Program Sequencing with Zero-Overhead
Looping: Single-Cycle Loop Setup
ADSP--2100 Family DSP Microcomputers
ADSP

Basic architecture of TMS320C54x family.
The fixed-point TMS320C54x

families supporting 16-bit data
have on-chip program memory
and data memory in various sizes
and configurations
The typical TMS320C54x

fixed-point DSP
architecture.

The typical TMS320C3x floating-
floating-point DSP.

Block diagram of TMS320C67x floating-
floating-point
DSP.

Registers of the TMS320C67x floating-point
DSP.

DSP range of applications

Data representation for different DSP
DSP device Word length (no. of bits) Representation format

Texas Instruments TMS320C30 32 Floating point
Texas Instruments TMS320C54x 16 Fixed point
Texas Instruments TMS320C62xxx 16 Fixed point
Texas Instruments TMS320C67xxx 32 Floating point
Analog Devices DSP-2110 16 Fixed point
Analog Devices SHARC-21061 32 Floating point
Motorola DSP56001/2/9 24 Fixed point
Motorola DSP96000 32 Floating point
Lucent Technologies DSP1600 16 Fixed point
Lucent Technologies DSP16000 32 Fixed point

DSP 320C6713

TMS320C6713
Highest-Performance Floating-Point Digital VelociTI Advanced Very Long Instruction

Signal Processor (DSP): TMS320C6713 Word (VLIW) TMS320C67x DSP Core
– Eight 32-Bit Instructions/Cycle – Eight Independent Functional Units:
– 32/64-Bit Data Word – Two ALUs (Fixed-Point)
– 225-MHz (GDP), 150-MHz (PYP) Clock – Four ALUs (Floating- and Fixed-Point)
Rates – Two Multipliers (Floating- and
– 4.4-, 6.7-ns Instruction Cycle Time Fixed-Point)
– 1800 MIPS/1350 MFLOPS, – Load-Store Architecture With 32 32-Bit
1200 MIPS /900 MFLOPS General-Purpose Registers
– Rich Peripheral Set, Optimized for Audio – Instruction Packing Reduces Code Size
– Highly Optimized C/C++ Compiler – All Instructions Conditional

TMS320C6713
L1/L2 Memory Architecture Two Multichannel Buffered Serial Ports
– 4K-Byte L1P Program Cache Two 32-Bit General-Purpose Timers
(Direct-Mapped) Dedicated GPIO Module With 16 pins
– 4K-Byte L1D Data Cache (2-Way) (External Interrupt Capable)
– 256K-Byte L2 Memory Total: 64K-Byte Flexible Phase-Locked-Loop (PLL) Based
L2 Unified Cache/Mapped RAM, and Clock Generator Module
192K-Byte Additional L2 Mapped RAM IEEE-1149.1 (JTAG†)
Device Configuration Boundary-Scan-Compatible
– Boot Mode: HPI, 8-, 16-, 32-Bit ROM Boot Package Options:
– Endianness: Little Endian, Big Endian – 208-Pin PowerPAD Plastic (Low-Profile)
16-Bit Host-Port Interface (HPI) Quad Flatpack (PYP)
Two Multichannel Audio Serial Ports – 272-Ball, Ball Grid Array Package (GDP)
(McASPs) 0.13-μm/6-Level Copper Metal Process
Two Inter-Integrated Circuit Bus (I2C Bus) – CMOS Technology
Multi-Master and Slave Interfaces 3.3-V I/Os, 1.2-V Internal (PYP)
3.3-V I/Os, 1.26-V Internal (GDP)

TMS320C67x Block Diagram
functional block
and CPU (DSP core)
diagram

TMS320C67x Block Diagram
One instruction is 32
bits. Program bus is 256 bits
wide.
- Can execute up to 8
instructions per clock cycle
(225MHz->4.4ns clock cycle).
8 independent functional
units:
- 2 multipliers
- 6 ALUs
Code is efficient if all 8
functional units are always
busy.
Register files each have 16
general purpose registers,
each 32-bits wide (A0-A15,
B0-B15).
Data paths are each 64 bits
wide.

C6713 Functional Units
Two data paths (A & B)

Data path A
Multiply operations (.M1)
Logical and arithmetic operations (.L1)
Branch, bit manipulation, and arithmetic operations (.S1)
Loading/storing and arithmetic operations (.D1)
Data path B
Multiply operations (.M2)
Logical and arithmetic operations (.L2)
Branch, bit manipulation, and arithmetic operations (.S2)
Loading/storing and arithmetic operations (.D2)
All data (not program) transfers go through .D1 and .D2

Fetch & Execute Packets
C6713 fetches 8 instructions at a time (256 bits)

Definition: “Fetch packet” is a group of 8 instructions fetched at once.
Coincidentally, C6713 has 8 functional units.
Ideally, all 8 instructions would be executed in parallel.
Often this isn’t possible, e.g.:
3 multiplies (only two .M functional units)
Results of instruction 3 needed by instruction 4 (must wait for 3 to
complete)

Execute Packets
Definition: “Execute Packet” is a group of (8 or less) consecutive instructions
in one fetch packet that can be executed in parallel.
C compiler provides a flag to indicate which instructions should be run in

parallel.
You have to do this manually in Assembly using “||”.

C6713 Instruction Pipeline Overview
All instructions flow through the following steps:
1. Fetch
a) PG: Program address Generate
b) PS: Program address Send
c) PW: Program address ready Wait
d) PR: Program fetch packet Receive
2. Decode
a) DP: Instruction DisPatch
each step
b) DC: Instruction DeCode = 1 clock cycle
3. Execute
a) 10 phases labeled E1-E10
b) Fixed point processors have only 5 phases (E1-E5)

Pipelining: Ideal Operation
Remarks:
• At clock cycle 11, the pipeline is “full”
• There are no holes (“bubbles”) in the pipeline in this example

Pipelining: “Actual
“Actual”” Operation
Remarks:
• Fetch packet n has 3 execution packets
• All subsequent fetch packets have 1 execution packet
• Notice the holes/bubbles in the pipeline caused by lack of parallelization
Execute Stage of C6713 Pipeline
C67x has 10 execute phases (floating point)
C62x/C64x have 5 execute phases (fixed point)

Different types of instructions require different numbers of these phases to
complete their execution
Anywhere between 1 and all 10 phases
Most instruction tie up their functional unit for only one phase (E1)

Execution Stage Examples (1)
results available
after E1 (zero delay slots)
Functional unit free

after E1 (1 functional
unit latency)

results available after

E4 (3 delay slots)
Functional unit free after E1
(1 functional unit latency)

Results available after

E10 (9 delay slots)
Functional unit free after E4
(4 functional unit latency)
Functional Latency & Delay Slots
Functional Latency: How long must we wait for the functional unit to be free?
Delay Slots: How long must we wait for the result?
General remarks:
Functional unit latency <= Delay slots
Strange results will occur in ASM code if you don’t pay attention to delay
slots and functional unit latency
All problems can be resolved by “waiting” with NOPs
Efficient ASM code tries to keep functional units busy all of the time.
Efficient code is hard to write (and follow).

DSP TMS 320C6713
Floating-Point Digital Signal Processor Device nomenclature

C6713 DSP Starter Kit (DSK)
The TMS320C6713 DSP Starter Kit (DSK)

developed jointly with Spectrum Digital
is a low-cost development platform
designed to speed the development of
high precision applications based on
TI´s TMS320C6000 floating point DSP
generation.
Hardware Feature:
Texas Instrument's TMS320C6713 DSP operating at 225 Mhz
Embedded USB JTAG controller with plug and play drivers,
USB cable included
TLV320AIC codec
2M x 32 on board SDRAM
512K bytes of on board Flash ROM
3 expansion connectors (Memory Interface, Peripheral Interface, and Host Port Interface)
On board IEEE 1149.1 JTAG connection for optional emulator debug
Four 3.5 mm. audio jacks (microphone, line-in, speaker, and line out)
4 user definable LEDs
4 position dip switch, user definable
+5 Volt operation only, power supply included
Size: 8.25" x 4.5" (210 x 115 mm), 0.062" thick, 6 layers
Compatible with Spectrum Digital's DSK Wire Wrap Prototype Card
Software Features:
TMS320C6713 DSK specific Code Composer Studio from Texas Instruments
Test/sample code provided to reduce coding time
Compatible with National Instruments LabView Embedded 2.0
Compatible with JTAG emulators from Spectrum Digital
Compatible with Win 2000/XP



Is my DSK working?
DSK Power On Self Test

Power up DSK and watch LEDs
Power On Self Test (POST) program stored in FLASH memory automatically
executes
POST takes 10-15 seconds to complete
All DSK subsystems are automatically tested
During POST, a 1kHz sinusoid is output from the AIC23 codec for 1 second
Listen with headphones or watch on oscilloscope
If POST is successful, all four LEDs blink 3 times and then remain on.

Interfacing with the Real World
TMS320C6713 DSK:
digital inputs = 4 DIP switches
digital outputs = 4 LEDs
ADC and DAC = AIC23 codec

DSP Cours V2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP Cours V2 PDF

Uploaded by

Copyright:

Available Formats

2016/2017

Filière : ING en Génie Electrique

Higher School of Technical Education

Pr. A.JBARI DSP implementation 1

Become familiar with

Pr. A.JBARI DSP implementation 2

Pr. A.JBARI DSP implementation 3

Pr. A.JBARI DSP implementation 4

Unsigned Integers: can represent zero and positive integers.

Representation Positive Negative

1's Complement representation [(MSB=0) N] [(MSB=1) 1’s complement(N)]

2's Complement representation [(MSB=0) N] [(MSB=1) 2’s complement(N)]

Excess (or bias) representation B

Representation Positive Negative

1's Complement representation 27= 00011011 -27=11100100

2's Complement representation 27= 00011011 -27=11100101

Excess (or bias) representation 27 27+63=90 = -27 -27+63=36 =

Pr. A.JBARI DSP implementation 6

Types: Computer can manipulate integers of various lengths (format):

Format Type Range

16 bits short (in Java, C and C++) [-32767 32768]

32 bits int (in Java, C and C++) [-231 231 - 1]

128 bits long long (in C and C++)

Pr. A.JBARI DSP implementation 7

Sign extension (widening conversion):

– 8 bit 2s compl. repr. for 7 is: 00000111

– 16 bit 2s compl. repr. for 7 is: 0000000000000111

– 8 bit 2s compl. repr. for -7 is: 11111001

– 16 bit 2s compl. repr. for -7 is: 1111111111111001

Pr. A.JBARI DSP implementation 8

Notation : fixed<w,b> : w denotes the number of bits used as a whole

Pr. A.JBARI DSP implementation 9

Pr. A.JBARI DSP implementation 10

The value (N) is calculated as follows:

The value (N) is calculated as follows:

Pr. A.JBARI DSP implementation 13

Analog & digital signals

Uniform (periodic) sampling.

Pr. A.JBARI DSP implementation 14

Pr. A.JBARI DSP implementation 16

The DSP block operates as

Pr. A.JBARI DSP implementation 17

Pr. A.JBARI DSP implementation 18

Interference Cancellation in Electrocardiography:

Pr. A.JBARI DSP implementation 19

communication systems astronomy

Sample and hold:

Closed-Loop Architecture with

Pr. A.JBARI DSP implementation 21

  =  ∗ 

Original baseband spectrum,

Sampled signal spectrum

 >  baseband spectrum and its

 = 2 baseband spectrum and its

 < 2 baseband spectrum and its

Pr. A.JBARI DSP implementation 23

To obtain exact reconstruction of the original signal spectrum by applying

Pr. A.JBARI DSP implementation 24

Pr. A.JBARI DSP implementation 25

A discreet-time system is a device or algorithm that operates

Pr. A.JBARI DSP implementation 26

Digital Filter: numerical procedure or algorithm that transforms a given sequence of

Finite Impulse Response (FIR) Filters:

= ∗

> baseband spectrum and its

= 2 baseband spectrum and its

< 2 baseband spectrum and its

Example: MPYF3 (AR0)++, (AR1)++, R0