You are on page 1of 40

Texas Instruments

TMS320C54x
DSP
Architecture and Programming
Andrew Fernandez
April 30, 2001

Why DSP?
Growing Market
Dedicated ASIC not always
best option for implementing
signal processing
Flexibility

Sample Products

Sony VAIO MC-P10


Music Clip
May 2, 2001

Nokia 8210 Handset

C54 Architecture and Programming

JVC GR-DVM90
Digital Camcorder
2

Different Goals Generate Different


Architectures
Differences from P
Multiply and Accumulate (MAC) is typical
Many memory accesses required
Predictable execution time

Types of Processing
Continuous / Real - Time
Limited storage
Hard constraints
Offline
Entire signal stored in memory
Softer constraints
May 2, 2001

C54 Architecture and Programming

Von Neumann Architecture

Inefficient for memory


intensive operations
One memory space
Example: 20 Tap FIR
4 Memory Accesses
1 Parallel MAC
At least 80 cycles per
output!
May 2, 2001

C54 Architecture and Programming

Next Obvious Step: Harvard Architecture

Separate program and data memory


We can do better

May 2, 2001

C54 Architecture and Programming

Modified Harvard Architecture (C54)

Separate program and data memory


Enables parallel memory access (improves w/ DARAM)
May store coefficients in program memory (ROM)
May 2, 2001

C54 Architecture and Programming

The Popular TMS320C54x


Technology

1.0 - 5.0 V Core


0.25 CMOS
30-160 MHz
0.21-0.52 mW/MIP
4.0 mW standby
(Texas Instruments - ISSCC 1997)

Architecture (From 10,000 ft.)


16 bit fixed instructions
64K/64K Data/Program
1 MAC, 1 ALU, 2
Accumulators

8 Auxiliary Registers (ARs)


DARAM
Compare Select and Store
(CSSU) for Viterbi

Source: www.ti.com and ISLPED 2000 Tutorial


May 2, 2001

C54 Architecture and Programming

C54 Block Diagram


Memory Access
4 Internal Bus Pairs
8 Auxiliary Registers
(AR0-AR7)
Address Generation
Circ. Buffers
Inc/Dec.

Number Crunching

40 bit Acc. (A and B)


40 bit Barrel Shifter
Temporary Register
Dedicated support
CSSU (Viterbi)
Bit reverse (FFT)

May 2, 2001

C54 Architecture and Programming

May 2, 2001

C54 Architecture and Programming

C54x Memory, Buses, and Pipeline


Program A/D Bus (P)
Internal

Data Read A/D Bus (D)

Memory

Data Read A/D Bus (C)

Extl
Mem
I/F

A
D

External
Memory

Data Write A/D Bus (E)

Pipeline Phases
P - generate program address
F - get opcode
D - decode instruction
A - generate read address
R - read operands
X - execute

P F D A R X
P F D A R X
P F D A R X
P F D A R X
P F D A R X
P F D A R X
Full Pipeline

May 2, 2001

C54 Architecture and Programming

10

C541 Memory Maps


0000

Program
RAM?

1400

0000
OVLY
bit

Data

0000

I/O

MMR / RAM

1400

External
memory

External
memory

I/O Memory

9000
Internal or
External
memory

E000
DROM
bit

FF80
FFFF

VECTORS

PAGE 0 (64K)

May 2, 2001

FFFF

External
memory
or Internal
ROM
PAGE 1 (64K)

C54 Architecture and Programming

FFFF

PAGE 2 (64K)
11

Our Generic Data Memory


0000

0000
DARAM
Block a

DARAM
and SARAM
1480

0000

0400

External
memory

FFFF
May 2, 2001

MMR
0060
0080

DARAM
Block a

SARAM

147F
C54 Architecture and Programming

SPRAM

03FF
12

Ground Zero: Programming


Characteristics of DSP routines
Short
Repeated very often
Time/Performance critical

Assembly!

High Level Language (C)


Speed in Development and Reuse
Lower Development Cost

Low Level (Assembly)


High Performance
Lower Product Cost
May 2, 2001

C54 Architecture and Programming

13

Shorthand Notation
Term
Smem
Xmem

src

What it means
16-bit single data memory operand
16-bit dual data memory operand used in dual-operand instructions
and some single-operand instructions. Read through D bus.
16-bit dual data-memory operand used in dual-operand instructions.
Read through C bus.
16-bit long constant
16-bit immediate data memory address (0 - 65,535)
16-bit immediate program memory address (0 - 65,535)
This includes extended program memory devices
Source accumulator (A or B)

dst
PA

Destination accumulator (A or B)
16-bit port (I/O) immediate address (0 - 65,535)

Ymem
lk
dmad
pmad

May 2, 2001

C54 Architecture and Programming

14

C54 Data Addressing


The C54x uses 5 basic data addressing modes:
Indirect Uses

16-bit registers as pointers

Direct

Random access from a specified base address

Absolute

Specify entire 16-bit address

Immediate

Instruction contains the data operand

MMR

Access memory mapped registers

May 2, 2001

C54 Architecture and Programming

15

Indirect Addressing Options


Option
No Modification

Syntax
*ARn

Action
no modification to ARn

Affected by:

Increment /
Decrement

*ARn+
*ARn-

post increment by 1
post decrement by 1

Indexed

*ARn+0
*ARn-0

post increment by AR0


post decrement by AR0

AR0

Circular

*ARn+%
*ARn-%
*ARn+0%
*ARn-0%

post increment by 1 - circular


post decrement by 1 - circular
post increment by AR0 - circular
post decrement by AR0 - circular

BK

Bit-Reversed

*ARn+0B
*ARn-0B

post inc. ARn by AR0 with reverse carry


post dec. ARn by AR0 with reverse carry

AR0
(=FFT size/2)

Pre-modify

*ARn (lk)
*+ARn (lk)
*+ARn (lk)%
*+ARn

*(ARn+LK), ARn unchanged


*(ARn+LK), ARn changed
*(ARn+LK), ARn changed - circular
pre-increment by 1, during write only

BK

*(lk)

16-bit lk is used as an absolute address


See Absolute Addressing

Absolute

May 2, 2001

C54 Architecture and Programming

BK, AR0

16

Indirect Addressing - *
LD
STL

*AR1+,A
A,*AR2+

;...

Indirect Addressing allows sequential access to arrays


8 address registers (AR0-7) can be used as 16-bit pointers to
data
ARs can be optionally modified
How do we initialize the ARs?

May 2, 2001

C54 Architecture and Programming

17

MMR and Immediate Addressing


start:STM
STM

#tbl,AR1
#x,AR2

STM (STore to Memorymapped register) stores an


immediate value to the specified
MMR or SPRAM address.

STM writes value to register in


the access phase of the pipeline
to avoid latencies (more later)

0000h
MMRs
0060h
SPRAM

STM to AR1
# tbl
16 bits

May 2, 2001

007Fh
#tbl is the 16-bit address of
the assembly variable tbl.
2 words, 2 cycles

Immediate operands, like #tbl,


are located in program memory
as part of the opcode.

C54 Architecture and Programming

18

Immediate Addressing (Cont.) - #


LD #k5, ASM
LD #k8, dst
LD #k9, DP
RPT #k8
FRAME #k8

May 2, 2001

Short immediate instructions are 1


word, 1 cycle:

All other immediate constants are


16 bits and require 2 words, 2
cycles.

;A or B

C54 Architecture and Programming

19

Direct Addressing - @
Instruction

opcode

7-bit offset

Address

9-bit DP

7-bit offset

16 bits

Direct Addressing allows random, single-cycle access to 128


locations positively offset from a base address
The direct 16-bit address is formed by concatenating the base
address (DP) with the 7-bit offset contained in the instruction:
How is the Data Page (DP) initialized?
May 2, 2001

C54 Architecture and Programming

20

Generating Direct Addresses


LD

#x,DP

LD
ADD
ADD

@x+1,A
@x,A
@x+2,A

The first instruction loads the upper 9 bits of


address x into DP (located in ST0) in a single
cycle.

0000 0000 1 000 0101

= 85h

16-bit address of x

LD @x+1,A

0000 0000 1 - Data Page 1


- Base Addr = 80h
DP
0000 0000 1 000 0110 = 86h

ADD @x,A

0000 0000 1 000 0101

= 85h

ADD @x+2,A

0000 0000 1 000 0111

= 87h

LD #x, DP

May 2, 2001

C54 Architecture and Programming

21

Absolute Addressing STL

*( )

A,*(y)

Guarantees access to any location in the memory map by


supplying the entire 16-bit address
Uses the indirect hardware to generate the address, hence the
asterisk ( )
Always MINIMUM of 2 words, 2 cycles

May 2, 2001

C54 Architecture and Programming

22

Dual Operand Instructions (X,Y)


Require less code
Execute faster
Dual operand addressing allows only certain pointers and
modes:
Pointers:
AR2
AR3
AR4
AR5

Modes:
*ARn
*ARn+
*ARn*Arn+0%

Modifiers: BK + AR0
Since the only index offered is circular, regular index is only accessible
if BK is set to 0, or made very large, e.g., FFFFh.
May 2, 2001

C54 Architecture and Programming

23

Example Program: FIR Filter


z-1

x0
a0

a1

x1

z-1
a2

x2

z-1
a3

x3

...

y0

19

y 0 = an * xn
n=0

y0 = a0*x0 + a1*x1 + a2*x2 + + a19*x19

20 Tap FIR implementation


Our goal is to compute one output (y0)
First, lets setup the link.cmd file and memory sections...
May 2, 2001

C54 Architecture and Programming

24

Coding Environment
lab1.obj
-o lab1.out
-m lab1.map

Overview

Link.cmd

ROM
MEMORY {
PAGE 1: /* Data Memory */
SPRAM: org=00060h len=0020h
InRAM: org=00400h len=0400h
OutRAM: org=00800h len=0400h
PAGE 0: /* Program Memory */
ROM:
org=0F000h len=0F80h
}
SECTIONS {
code
:>
init
:>
input
:>
output :>
coeff
:>
}
May 2, 2001

ROM
ROM
InRAM
OutRAM
SPRAM

PAGE
PAGE
PAGE
PAGE
PAGE

0
0
1
1
1

code
init_a[20]

C54x
InRAM
x[20]
OutRAM
y[1]
SPRAM
a[20]

x
a
y

.usect input",20
.usect coeff",20
.usect output",1
.sect init
init_a .int
1,2,3,4,5
.int
1,2,3,4,5
.int
1,2,3,4,5
.int
1,2,3,4,5
.mmregs
FIR.asm
.sect "code"

C54 Architecture and Programming

25

Processing Loop
fir:

FIR.asm

Two methods may be used to


find y0:
1. Multiply, then add

math: MAC

*AR2+,*AR3+,A

MPY *AR2+, *AR3+, B


ADD B,A

2. Multiply/Accumulate
MAC *AR2+, *AR3+, A
done:

Dual-operand instructions must


use:
AR2, AR3, AR4, AR5
Modifiers: none, +, -, +0%
May 2, 2001

C54 Architecture and Programming

26

Initialize Pointers
Coefficients

FIR.asm

fir:

AR2

math:

STM
STM
STM
STM

#a,AR2
#a,AR2
#x,AR3
#x,AR3

MAC

*AR2+,*AR3+,A

a0
a1
a2
...

Input Data
AR3

x0
x1
x2
...

STM
done:

May 2, 2001

Stores #value to the MMR


early in the pipeline to avoid
latencies
2 words, 2 cycles

C54 Architecture and Programming

27

Load Accumulator
FIR.asm

fir:

We must first initialize A using a


load instruction.
LD source, [leftshift,] dst

math:

STM
STM
LD

#a,AR2
#x,AR3
#0,A

source: constant or memory

MAC

*AR2+,*AR3+,A

leftshift: Ex: LD @x,16,A

location
none
T [5:0] (use TS)
constant (-16 to +16)

done:

dst: A,B,T,DP,ASM
Accumulator A
G
39-32

H
31-16

May 2, 2001

L
15-0

LD:
Loads dst[15:0] by default
May be 1 or 2 cycles

C54 Architecture and Programming

28

Store Result
FIR.asm

fir:

math:

STM
STM
LD

#a,AR2
#x,AR3
#0,A

MAC
STL

*AR2+,*AR3+,A
A, *(y)

Memory is 16 bits wide, so we


must specify which part of result
to store
STL/H source, [leftshift,] dst

source: Accumulators A,B


leftshift:
Ex: STL B,-8,*AR5-

done:

none
ASM
constant (-16 to 15)
dst: any memory location

STL/STH may be 1 or 2 cycles

Accumulator A
G
39-32

H
31-16

May 2, 2001

L
15-0

C54 Architecture and Programming

29

Streamline Loops
FIR.asm

fir:

math:

STM
STM
LD
RPT
MAC
STL

#a,AR2
#x,AR3
#0,A
#(20-1)
*AR2+,*AR3+,A
A, *(y)

Execute the next instruction n+1


times:

1. RPT #n
2. RPT Smem
3. RPTZ src,#n

RPT: 1 or 2 cycles
RPTZ: Clears the ACC before
repeating. Always 2 words, 2
cycles

done:
Execute the next block of instructions
n+1 times:

STM #n, BRC


then...
RPTB done-1
May 2, 2001

C54 Architecture and Programming

30

Copy Coefficients

Copy values from one memory


location to another:

FIR.asm
fir:

math:

STM
RPT
MVPD
STM
STM
LD
RPT
MAC
STL

#a, AR2
#3
#(20-1)
#init_a,*AR2+
#a,AR2
#x,AR3
#0,A
#(20-1)
*AR2+,*AR3+,A
A, *(y)

MVPD #pmad, Smem


PC

init_a
1
2
3
...

a
1

AR2

PC=PC+1 every access


Move instructions:

done:

May 2, 2001

Prog
Data
MVPD,MVDP
READA,WRITA

MMR
Data
MVMD,MVDM

Data
Data
MVKD,MVDK,MVDD

MMR
MVMM

C54 Architecture and Programming

MMR

31

Program Flow

fir:

math:

done:

STM
RPT
MVPD
STM
STM
LD
RPT
MAC
STL

#a, AR2 FIR.asm


#(20-1)
#init_a,*AR2+
#a,AR2
#x,AR3
#0,A
#(20-1)
*AR2+,*AR3+,A
A, *(y)

done:

CALL
RET

fir

2w, 4c
1w, 4c

Other program flow instructions:


B
BACC
CALA

RET
- or -

Implementing a subroutine
requires:

next
src
src

Conditional program flow:


BC
next,cnd, ...
CC
next,cnd, ...
RC
cnd, ...

2w, 4c
1w, 6c
1w, 6c

2w, 3c/5c
2w, 3c/5c
1w, 3c/5c

done

Conditions: 3 max w/ restrictions,


ANDed: Ex: CC fir, AEQ, AOV
A/B: EQ,NEQ,LEQ,GEQ,LT,GT,OV,NOV
TC,NTC,C,NC,BIO,NBIO

May 2, 2001

C54 Architecture and Programming

32

Fixed Point Processing


Q Point notation for
placement of decimal in fixed
bit number
40 bit accumulators used to
prevent overflow (usually)
Fractional multiplication to
retain most of data in the
product
Saturation and Rounding
Sign extension

May 2, 2001

C54 Architecture and Programming

value

value

8 1

double

33

Handling Accumulator Overflow


39

A or B

32 31

Guard

16 15

High

Low

Guard bits increase dynamic range from +/-1 to +/-128

Use Guard Bits (allow at least 128 signed summations)


In a non-gain system temporary overflow is permitted. The
output is guaranteed to remain bounded by the input.
In a system with gain, the output is not guaranteed to remain
bounded (i.e. result is larger than 32-bits).

How do you handle a result larger than 32-bits?


May 2, 2001

C54 Architecture and Programming

34

Fractional Multiplication

. 9
. 9
. 8 1
. 8

value
times value
yields double size result
result to be stored

We assume F*F < 1


Notice that most of information is retained

May 2, 2001

C54 Architecture and Programming

35

Fractional Multiplication (Cont.)


Fractional model
0100
x 1101
00000100
0000000
000100
11100
11110100
ACC

1111 0100

mem

1110

-1 1/2 1/4 1/8

Store 1.110 (-1/4) to memory

How is the redundant sign bit


eliminated?
STH A,1,*AR0
-ORSSBX FRCT
STH A,*AR0

;MANUAL
;AUTO

FRCT shifts multiply results left by 1


The tools do not support fractions
To store 0.707 use:
32767 = 7FFFh = ~1
a0 .int 32768*707/1000

May 2, 2001

C54 Architecture and Programming

36

Implementing Circular Buffering


math: MAC
Coefficients

AR2

*AR2+,*AR3+0%,A
Input Buffer

a0

x[0]

a1

x[1]

a2

x[2]

...

...

Circular addressing is modulo

First, define buffer size using


BK

AR3
STM
l

% modifier indicates circular is


available for all ARs

Why was +0% used?

Because we are forced to use


+0%, how do we make it look
like +%?

x[n]
BK

...

STM

May 2, 2001

#(N+1),BK

C54 Architecture and Programming

#1,AR0

37

Circular Buffer Alignment


Circular Buffer

x[0]
x[1]

x[2]

...
x[19]
BK

...
align 32

May 2, 2001

Circular Buffers must be aligned on the


next 2n boundary greater than BK.
On what boundary should a block size of
20 be aligned?
.usect coeff, 20

How? Use align argument in the linker


command file:
SECTIONS{
coeff :> DARAM align(32) PAGE 1
}

The linker will attempt to fill unused


memory locations

C54 Architecture and Programming

38

Pipeline Issues
Analysis:

Typical C54x System Code

C Code

Most 'C54x code requires


no special attention
Some MMR writes require care
(MMR reads are not a problem)

ASM Code

No Problem

Latency requirements
resolved via Latency Tables

CALU Operations
No Problem

MMR Writes

Early Writes

Step through code and use


NOPs to resolve conflicts

All Other MMR Writes

Early: write occurs at least


6 cycles prior to a read

Use Latency Tables

Example: FIR setup code


May 2, 2001

C54 Architecture and Programming

39

References
[1] TMS320C54x Users Guide, available from the Texas Instruments Literature Response Center.
[2] TMS320C54x DSP Design Workshop, Texas Instruments Technical Training.
[3] S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, San Diego: California Technical
Publishing, 1999.
[4] Ingrid Verbauwhede, Dave Garrett, Low-Power DSPs for Wireless Communications, ISLPED 2000.

May 2, 2001

C54 Architecture and Programming

40

You might also like