You are on page 1of 97

unit-5 DSP Processors

UNIT V

DIGITAL SIGNAL PROCESSORS

Digital Signal Processors: Introduction to programmable DSP processors – Von-Neumann architecture- Harvard
architecture- VLIW architecture – MAC unit- pipelining.- Special addressing modes in P-DSPs- On chip
peripherals, PDSPs with RISC and CISC- Architecture and addressing modes of TMS320C50 and TMS320C6X.

DSP is a technique of performing the mathematical operations on the signals in digital domain. Digital Signal
Processors (DSPs) are microprocessors with the following characteristics:

a) Real-time digital signal processing capabilities. DSPs typically have to process data in real time, i.e.,
the correctness of the operation depends heavily on the time when the data
processing is completed.
b) High throughput. DSPs can sustain processing of high-speed streaming data, such as audio and
multimedia data processing.
c) Deterministic operation. The execution time of DSP programs can be foreseen accurately, thus
guaranteeing a repeatable, desired performance.
d) Re-programmability by software. Different system behaviour might be obtained by
re-coding the algorithm executed by the DSP instead of by hardware modifications.

DSPs appeared on the market in the early 1980s. Over the last 15 years they have been the key enabling technology
for many electronics products in fields such as communication systems, multimedia, automotive, instrumentation
and military. Fig 1. gives an overview of the evolution of DSP features together with the first year of marketing for
some DSP families.

Fig. 1. Evolution of DSP features from their early days until now.
unit-5 DSP Processors

Table 1 gives an overview of some of these fields and of the corresponding typical DSP applications.
unit-5 DSP Processors

DSP vs. General Purpose MPU


 DSPs tend to be written for 1 program, not many programs.
o Hence OSes are much simpler, there is no virtual memory or protection, ...
 DSPs sometimes run hard real-time apps
o You must account for anything that could happen in a time slot
o All possible interrupts or exceptions must be accounted for and their collective time be subtracted
from the time interval.
o Therefore, exceptions are BAD!
 DSPs have an infinite continuous data stream
 The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
o DSP are judged by whether they can keep the multipliers busy 100% of the time.
 The "SPEC" of DSPs is 4 algorithms:
o Inifinite Impule Response (IIR) filters
o Finite Impule Response (FIR) filters
o FFT, and
o convolvers
 In DSPs, algorithms are king!
o Binary compatability not an issue
 Software is not (yet) king in DSPs.
o People still write in assembly language for a product tominimize the die area for ROM in the DSP
chip.
TYPES OF DSP PROCESSORS
 DSP Multiprocessors on a die
o TMS320C80
o TMS320C6000
 32-BIT FLOATING POINT
o TI TMS320C4X
o MOTOROLA 96000
o AT&T DSP32C
o ANALOG DEVICES ADSP21000
 16-BIT FIXED POINT
o TI TMS320C2X
o MOTOROLA 56000
o AT&T DSP16
o ANALOG DEVICES ADSP2100
DSP vendors
• Analog Devices (ADI),
• Freescale
• (formerly Motorola),
• Texas Instruments (TI),
• Renesas,
• Microchip and VeriSilicon

I. Introduction to programmable DSP processors

Programmable digital signal processors (PDSPs) are general-purpose microprocessors designed specifically for
digital signal processing (DSP) applications. They contain special instructions and special architecture supports so as
to execute computation-intensive DSP algorithms more efficiently. PDSPs are designed mainly for embedded DSP
applications. As such, the user may never realize the existence of a PDSP in an information appliance. Important
applications of PDSPs include
unit-5 DSP Processors

 modem
 hard drive controller
 cellular phone data pump
 set-top box, etc.
The categorization of PDSPs falls between the general-purpose microprocessor and the custom designed, dedicated
chip set. The former have the advantage of ease of programming and development. However, they often suffer from
disappointing performance for DSP applications due to overheads incurred in both the architecture and the
instruction set. Dedicated chip sets, on the other hand, lack the flexibility of programming. The time to market delay
due to chip development may be longer than the program coding of programmable devices.

1.1 Basic Architectural Features

A programmable DSP device should provide instructions similar to a conventional microprocessor. The instruction
set of a typical DSP device should include the following,

a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc


b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation

In addition to the above provisions, the architecture should also include,


a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)

1.2 DSP Computational Building Blocks

Each computational block of the DSP should be optimized for functionality and speed and in the meanwhile the
design should be sufficiently general so that it can be easily integrated with other blocks to implement overall DSP
systems.
o Multipliers
o Parallel Multipliers
o Multipliers for Signed Numbers
o Speed
o Bus Widths
o Shifters
o Barrel Shifters
1.3 Multiply and Accumulate Unit
o Overflow and Underflow
 Shifters
 Guard bits
 Saturation Logic
1.4 Arithmetic and Logic Unit
o Status Flags
o Overflow Management
o Register File
1.5 Bus Architecture and Memory
o On-chip Memories
 Speed
unit-5 DSP Processors

 Size
o Organization of On-chip Memories
1.6 Data Addressing Capabilities
o Immediate Addressing Mode
o Register Addressing Mode
o Direct Addressing Mode
o Indirect Addressing Mode
1.7 Special Addressing Modes
o Circular Addressing Mode
o Bit Reversed Addressing Mode
1.8 Address Generation Unit
1.9 Programmability and program Execution
o Program Control
o Program Sequencer

PDSP core architecture


1. Introduction
2. Fast data access
– High-bandwidth memory architectures
– Specialized addressing modes
– Direct Memory Access (DMA) controller
3. Fast computation
– MAC-centred
– Instruction pipelining
– Parallel architectures
4. Numerical fidelity
5. Fast-execution control

1. Introduction
DSP architecture has been shaped by the requirements of predictable and accurate real-time digital signal
processing. An example is the Finite Impulse Response (FIR) filter, with the corresponding mathematical equation
(1), where y is the filter output, x is the input data and a is a vector of filter coefficients. Depending on the
application, there might be just a few filter coefficients or many hundreds or more.

---------------- (1)
As shown in Eq. (1), the main component of a filter algorithm is the ‘multiply and accumulate’
operation, typically referred to as MAC. Coefficients data have to be retrieved from the memory and the whole
operation must be executed in a predictable and fast way, so as to sustain a high throughput rate. Finally, high
accuracy should typically be guaranteed. These requirements are common to many other algorithms performed in
digital signal processing, such as Infinite Impulse Response (IIR) filters and Fourier Transforms. Table 2 shows a
selection of processing requirements together with the main DSP hardware features satisfying them.

Table 2. Main requirements and corresponding DSP hardware implementations for predictable and accurate
real-time digital signal processing.
unit-5 DSP Processors

Hardware implementations satisfying the


Processing requirements requirement

 High-bandwidth memory architectures


 Specialized addressing modes
1. Fast data access  Direct Memory Access (DMA)

 MAC-centred
 Pipelining
2. Fast computation  Parallel architectures (VLIW, SIMD)

 Wide accumulator registers, guard bits, etc.


3. Numerical fidelity

 Hardware-assisted, zero-overhead loops,


4. Fast execution control
shadow registers, etc.

2. Fast data access


 It refers to the need of transferring data to / from memory or DSP peripherals, as well as retrieving
instructions from memory.
 The hardware implementations considered
a. High-bandwidth memory architectures
b. Specialized addressing modes
c. Direct Memory Access (DMA) controller
2.1. High-bandwidth memory architectures

• Traditional general-purpose microprocessors are based upon the Von Neumann architecture.
• Disadv:
– only one memory access per instruction cycle is possible

• DSPs are typically based upon the Harvard architecture.


• allows fetching program instructions and data at the same time, thus providing better performance at the
price of an increased hardware complexity and cost.

Super-Harvard architecture
unit-5 DSP Processors

• The Harvard architecture can be improved by adding to the DSP core a small bank of fast memory, called
‘instruction cache’, and allowing data to be stored in the program memory.
• The last-executed program instructions are relocated at run time in the instruction cache.

• Recent improvement of the Harvard architecture is the presence of a ‘data cache’, namely a fast memory
located close to the DSP core which is dynamically loaded with data.

• The L1 cache comprises 8 kbyte of memory divided into 4 kbyte of program cache and 4 kbyte of data
cache.
• The L2 cache comprises 256 kbyte of memory divided into 192 kbyte mapped-SRAM memory and 64
kbyte dual cache memory.
• The latter can be configured as mapped memory, cache or a combination of the two.
• Adv:
• the fact of having the cache memory very close to the DSP allows clocking it at high speed, as
routing wire delays are short.
• cache memories improve the average system performance.
• Drawbacks: lack of full predictability for cache hits.
• a missing cache hit is, for instance, the flow change due to branch instructions.

DSP hierarchical memory architecture


unit-5 DSP Processors

• A hierarchical memory allows one to take advantage of both the speed and the capacity of different
memory types.
– Registers are banks of very fast internal memory, typically with single-cycle access time. They are
a precious DSP resource used for temporary storage of coefficients and intermediate processing
values.
– The L1 cache is typically high-speed static RAM made of five or six transistors. The amount of L1
cache available thus depends directly on the available chip space.
– A L2 cache needs typically a smaller number of transistors hence can be present in higher
quantities inside the DSPs. Recent years have also seen the integration of DRAM memory blocks
into the DSP chip, thus guaranteeing larger internal memories with relatively short access times.
– The Level 3 (L3) memory is rarely present in DSPs while the external memory is typically
available. This is often a large memory with long access times.

2.2. Specialized addressing modes


• DSPs include specialized addressing modes and corresponding hardware support to allow a rapid
access to instruction operands through rapid generation of their location in memory.
Program sequencer and address generator units location within a generic DSP core

• Program Sequencer block


– manages program structure and program flow by supplying addresses to memory for instruction
fetches.
• Address generator blocks
– control the address generation for specialized addressing modes such as indexing addressing,
circular buffers, and bit-reversal addressing.

2.3. Direct Memory Access (DMA)


• The DMA controller is a second processor working in parallel with the DSP core and dedicated to
transferring information between two memory areas or between peripherals and memory.
unit-5 DSP Processors

• In doing so the DMA controller frees the DSP core for other processing tasks.

3. Fast computation
• MAC-centred
• Pipelining
• Parallel architectures (VLIW, SIMD)

3.1. MAC-centred
 The basic DSP arithmetic processing blocks are
a) many registers;
b) one or more multipliers;
c) one or more Arithmetic Logic Units (ALUs);
d) one or more shifters.
 These blocks work in parallel during the same clock cycle thus optimizing MAC as well as other
arithmetic operations.

Basic DSP arithmetic processing blocks

a) Registers:
unit-5 DSP Processors

– these are banks of very fast memory used to store intermediate data processing. Very often they
are wider than the DSP normal word width, so as to provide a higher resolution during the
processing.
b) Multiplier:
– it can carry out single-cycle multiplications and very often it includes very wide accumulator
registers to reduce round-off or truncation errors.
– As a consequence, truncation and round-off errors will happen only at the end of the data
processing, when the data is stored onto memory.
– Sometimes an adder is integrated in the multiplier unit.
c) ALU:
- it carries out arithmetic and logical operations.
d) Shifters:
- it shifts the input value by one or more bits, left or right. In the latter case, the shifter is called a
barrel shifter and is especially useful in the implementation of floating point add and subtract
operations.

3.2. Instruction pipelining


• It consists of dividing the execution of instructions into different stages and executing the different
instructions in parallel stages.
• The net result is an increased throughput of the instruction execution.

The three basic pipelining stages and corresponding actions

1. Fetch. The DSP calculates the address of the next instruction to execute and retrieve the opcode, i.e., the binary
word containing the operands and the operation to be carried out on them.
2. Decode. The op-code is interpreted and sent to the corresponding functional unit. The instruction is interpreted
and the operands are retrieved.
3. Execute. The instruction is executed and the results are written onto the registers.

Instruction execution and processing time gain of a pipelined CPU (plot b) with respect to
a non-pipelined one (plot a)
unit-5 DSP Processors

• A pipeline is called fully-loaded if all stages are executed at the same time; this corresponds to the
maximum possible instruction throughput.
• The depth of the pipeline, i.e., the number of stages into which an instruction is divided, can vary from one
processor to another.
• Generally speaking a deeper pipeline allows the processor to execute faster, hence many processors sub-
divide pipeline stages into smaller steps, each one executed at each clock cycle.
• The smaller the step, the faster the processor clock speed can be.
• An example of deep pipeline is the TI TMS320C6713 DSP, which includes four fetch stages, two decode
stages, and up to ten execution stages.

• Drawback:
– Hardware and programming complexity required
3.3 Parallel architectures
• The DSP performance can be increased by an increased parallelism in the instructions execution.
unit-5 DSP Processors

• Parallel-enhanced DSP architectures started to appear on the market in the mid 1990s and were based on
o Very Long Instruction Word (VLIW), instruction-level parallelism
o Single-Input Multiple-Data (SIMD), data-level parallelism
o or a combination of both.

TI TMS320C6xxx family VLIW architecture

• VLIW architectures are based upon instruction level parallelism, i.e., many instructions are issued at the
same time and are executed in parallel by multiple execution units.
• As a consequence, DSPs based on this architecture are also called ‘multi-issue’ DSP.
• This is an innovative architecture that was first used in the TI TMS320C62xx DSP family.
• eight, 32-bit instructions are packed together in a 256-bit wide instruction which is fed to eight separate
execution units.
• Characteristics of VLIW architectures include simple and regular instruction sets.
• Instruction scheduling is done at compile-time and not at run-time so as to guarantee a deterministic
behaviour.
• Adv:
• it can increase the DSP performance for a wide range of algorithms.
• Additionally, the architecture is potentially scalable, i.e., more execution units could be added to
allow a higher number of instructions to be executed in parallel.
• Disadv:
• high memory use and power consumption required by this architecture.
• From a programmer’s viewpoint, writing assembly code for VLIW architecture is very complex
and the optimization is often better left to the compiler.

SIMD architecture
o only one instruction is issued at a time but the same operation specified by the instruction is performed on
multiple data sets.
unit-5 DSP Processors

• Two 32-bit input registers provide four, 16-bit each, data inputs.
• They are processed in parallel by two separate execution units that carry out the same operation.
• Adv:
– it is applicable to other architectures, an example is the ADI TigerSHARC DSP that comprises
both VLIW and SIMD characteristics.
• Drawbacks:
– not useful for algorithms that process data serially or that contain tight feedback loops.

4. Numerical fidelity
• Arithmetic operations such as additions and multiplications are the heart of DSP systems.
• It is thus essential that the numerical fidelity be maximized, i.e., that errors due to the finite number of bits
used in the number representation and in the arithmetic operations be minimized.
• DSPs have many ways to obtain this, ranging from the numeric representation to dedicated hardware
features.

Using Number representation


• fixed point and
• floating point.
Using Hardware
• large accumulator registers, used to hold intermediate and final results of arithmetic operations.
• These registers are several bits (at least four) wider than the normal registers in order to prevent overflow
as much as possible during accumulation operations.
• The extra bits are called guard bits and allow one to retain a higher precision in intermediate computation
steps.
• Flags to indicate that an overflow/underflow has happened are also available.
• These flags are often connected to interrupts, thus allowing exception-handling routines to be called.
Using saturated arithmetic
• a number is saturated to the maximum value that can be represented, so as to avoid wrap-around
phenomena.

5. Fast-execution control
• two important examples of how DSP can fast-execute control instructions.
• The first example is the zero-overhead hardware loop and refers to the program flow control in loops.
• The second example refers to how DSPs react to interrupts.
1. zero-overhead hardware loop
unit-5 DSP Processors

 Looping is a critical feature in many digital signal processing algorithms.


 An important DSP feature is the implementation by hardware of looping constructs, referred to as
‘zero-overhead hardware loop’.
 This allows DSP programmers to initialize loops by setting a counter and defining the loop bounds,
without spending any software overhead to update and test loop counters or branching back to the
beginning of the loop.
2. Reaction to interrupts
3. When an interrupt is received and if the interrupt has a sufficiently-high priority, the DSP must carry out
the following actions:
a) stop its current activity;
b) save the information related to the interrupted activity (called context) into the DSP stack;
c) start servicing the interrupt.
The context corresponding to the interrupted activity can be restored when the ISR has been executed and
the previous activity is continued.

Specialized Peripherals for DSPs


• Synchronous serial ports
• Parallel ports
• Timers
• On-chip A/D, D/A converters
• Host ports
• Bit I/O ports
• On-chip DMA controller
• Clock generators
On-chip peripherals often designed for “background” operation, even when core is powered down.
TMS320C5x
unit-5 DSPKey Features
Processors

1.3 TMS320C5x Key Features


Key features of the ’C5x DSPs are listed below. Where a feature is exclusive
to a particular device, the device’s name is enclosed within parentheses and
noted after that feature.

- Compatibility: Source-code compatible with ’C1x, ’C2x, and ’C2xx devices

- Speed: 20-/25-/35-/50-ns single-cycle fixed-point instruction execution


time (50/40/28.6/20 MIPS)

- Power

J 3.3-V and 5-V static CMOS technology with two power-down modes
J Power consumption control with IDLE1 and IDLE2 instructions for
power-down modes

- Memory

J 224K-word × 16-bit maximum addressable external memory space


(64K-word program, 64K-word data, 64K-word I/O, and 32K-word
global memory)
J 1056-word × 16-bit dual-access on-chip data RAM
J 9K-word × 16-bit single-access on-chip program/data RAM (’C50)
J 2K-word × 16-bit single-access on-chip boot ROM (’C50, ’C57S)
J 1K-word × 16-bit single-access on-chip program/data RAM (’C51)
J 8K-word × 16-bit single-access on-chip program ROM (’C51)
J 4K-word × 16-bit single-access on-chip program ROM (’C52)
J 3K-word × 16-bit single-access on-chip program/data RAM (’C53,
’C53S)
J 16K-word × 16-bit single-access on-chip program ROM (’C53, ’C53S)
J 6K-word × 16-bit single-access on-chip program/data RAM (’LC56,
’C57S, ’LC57)
J 32K-word × 16-bit single-access on-chip program ROM (’LC56,
’LC57)

Introduction 1-7
TMS320C5x Key Features unit-5 DSP Processors

- Central processing unit (CPU)

J Central arithmetic logic unit (CALU) consisting of the following:


H 32-bit arithmetic logic unit (ALU), 32-bit accumulator (ACC), and
32-bit accumulator buffer (ACCB)
H 16-bit × 16-bit parallel multiplier with a 32-bit product capability
H 0- to 16-bit left and right data barrel-shifters and a 64-bit incre-
mental data shifter
J 16-bit parallel logic unit (PLU)
J Dedicated auxiliary register arithmetic unit (ARAU) for indirect
addressing
J Eight auxiliary registers

- Program control

J 8-level hardware stack


J 4-deep pipelined operation for delayed branch, call, and return
instructions
J Eleven shadow registers for storing strategic CPU-controlled regis-
ters during an interrupt service routine (ISR)
J Extended hold operation for concurrent external direct memory
access (DMA) of external memory or on-chip RAM
J Two indirectly addressed circular buffers for circular addressing

- Instruction set

J Single-cycle multiply/accumulate instructions


J Single-instruction repeat and block repeat operations
J Block memory move instructions for better program and data man-
agement
J Memory-mapped register load and store instructions
J Conditional branch and call instructions
J Delayed execution of branch and call instructions
J Fast return from interrupt instructions
J Index-addressing mode
J Bit-reversed index-addressing mode for radix-2 fast-Fourier trans-
forms (FFTs)

1-8
TMS320C5x
unit-5 DSPKey Features
Processors

- On-chip peripherals

J 64K parallel I/O ports (16 I/O ports are memory-mapped)


J Sixteen software-programmable wait-state generators for program,
data, and I/O memory spaces
J Interval timer with period, control, and counter registers for software
stop, start, and reset
J Phase-locked loop (PLL) clock generator with internal oscillator or
external clock source
J Multiple PLL clocking option (x1, x2, x3, x4, x5, x9, depending on the
device)
J Full-duplex synchronous serial port interface for direct communica-
tion between the ’C5x and another serial device
J Time-division multiplexed (TDM) serial port (’C50, ’C51, ’C53)
J Buffered serial port (BSP) (’LC56, ’C57S, ’LC57)
J 8-bit parallel host port interface (HPI) (’C57, ’C57S)

- Test/Emulation

J On-chip scan-based emulation logic


J IEEE JTAG Standard 1149.1 boundary scan logic (’C50, ’C51, ’C53,
’C57S)

- Packages

J 100-pin quad flat-pack (QFP) package (’C52)


J 100-pin thin quad flat-pack (TQFP) package (’C51, ’C52, ’C53S,
’LC56)
J 128-pin TQFP package (’LC57)
J 132-pin bumpered quad flat-pack (BQFP) package (’C50, ’C51, ’C53)
J 144-pin TQFP package (’C57S)

Introduction 1-9
unit-5 DSP Processors

Chapter 2

Architectural Overview

This chapter provides an overview of the architectural structure of the ’C5x,


which consists of the buses, on-chip memory, central processing unit (CPU),
and on-chip peripherals.

The ’C5x uses an advanced, modified Harvard-type architecture based on the


’C25 architecture and maximizes processing power with separate buses for
program memory and data memory. The instruction set supports data trans-
fers between the two memory spaces. Figure 2–1 shows a functional block
diagram of the ’C5x.

All ’C5x DSPs have the same CPU structure; however, they have different
on-chip memory configurations and on-chip peripherals.

Topic Page

2.1 Bus Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3


2.2 Central Processing Unit (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2.3 On-Chip Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.4 On-Chip Peripherals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.5 Test/Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11

Architectural Overview 2-1


’C5x Functional Block Diagram unit-5 DSP Processors

Figure 2–1. ’C5x Functional Block Diagram


Data bus

Memory
Program Data/Program
ROM SARAM Peripherals

’C50 2K ’C50 9K 6
’C51 8K ’C51 1K Serial port 1
’C52 4K ’C52 — Data DARAM
’C53 16K ’C53 3K Data/Program
’LC56 32K ’LC56 6K DARAM B2 (32 X 16) 6
’C57S 2K ’C57S 6K Serial port 2
’LC57 32K ’LC57 6K B0 (512 X 16) B1 (512 X 16)
6
TDM
serial port

Program bus 6
Buffered
serial port

1
Timer
Program
controller
18
Memory- Host port
Program interface
Memory control counter mapped
registers CALU
Multiprocessing Parallel 7
Status/control
registers logic Test/emulation
Interrupts
D Multiplier unit
Initialization D Accumulator (PLU)
Hardware stack
Auxiliary D ACC Buffer
Oscillator/timer register D Shifters
Address generation arithmetic
logic D Arithmetic
unit logic unit (ALU)
(ARAU)
Instruction register
CPU

Data bus

2-2
Bus
unit-5 DSP Structure
Processors

2.1 Bus Structure


Separate program and data buses allow simultaneous access to program
instructions and data, providing a high degree of parallelism. For example,
while data is multiplied, a previous product can be loaded into, added to, or
subtracted from the accumulator and, at the same time, a new address can be
generated. Such parallelism supports a powerful set of arithmetic, logic, and
bit-manipulation operations that can all be performed in a single machine
cycle. In addition, the ’C5x includes the control mechanisms to manage inter-
rupts, repeated operations, and function calling.

The ’C5x architecture is built around four major buses:

- Program bus (PB)


- Program address bus (PAB)
- Data read bus (DB)
- Data read address bus (DAB)

The PAB provides addresses to program memory space for both reads and
writes. The PB also carries the instruction code and immediate operands from
program memory space to the CPU. The DB interconnects various elements
of the CPU to data memory space. The program and data buses can work
together to transfer data from on-chip data memory and internal or external
program memory to the multiplier for single-cycle multiply/accumulate opera-
tions.

Architectural Overview 2-3


Central Processing Unit (CPU) unit-5 DSP Processors

2.2 Central Processing Unit (CPU)


The ’C5x CPU consists of these elements:

- Central arithmetic logic unit (CALU)


- Parallel logic unit (PLU)
- Auxiliary register arithmetic unit (ARAU)
- Memory-mapped registers
- Program controller

The ’C5x CPU maintains source-code compatibility with the ’C1x and ’C2x
generations while achieving high performance and greater versatility. Im-
provements include a 32-bit accumulator buffer, additional scaling capabili-
ties, and a host of new instructions. The instruction set exploits the additional
hardware features and is flexible in a wide range of applications. Data man-
agement has been improved through the use of new block move instructions
and memory-mapped register instructions. See Chapter 3, Central Processing
Unit (CPU).

2.2.1 Central Arithmetic Logic Unit (CALU)


The CPU uses the CALU to perform 2s-complement arithmetic. The CALU
consists of these elements:

- 16-bit  16-bit multiplier


- 32-bit arithmetic logic unit (ALU)
- 32-bit accumulator (ACC)
- 32-bit accumulator buffer (ACCB)
- Additional shifters at the outputs of both the accumulator and the product
register (PREG)

For information on the CALU, see Section 3.2, Central Arithmetic Logic Unit
(CALU), on page 3-7.

2.2.2 Parallel Logic Unit (PLU)


The CPU includes an independent PLU, which operates separately from, but
in parallel with, the ALU. The PLU performs Boolean operations or the bit ma-
nipulations required of high-speed controllers. The PLU can set, clear, test, or
toggle bits in a status register, control register, or any data memory location.
The PLU provides a direct logic operation path to data memory values without
affecting the contents of the ACC or PREG. Results of a PLU function are writ-
ten back to the original data memory location. For information on the PLU, see
Section 3.3, Parallel Logic Unit (PLU), on page 3-15.

2-4
Centralunit-5
Processing Unit (CPU)
DSP Processors

2.2.3 Auxiliary Register Arithmetic Unit (ARAU)


The CPU includes an unsigned 16-bit arithmetic logic unit that calculates
indirect addresses by using inputs from the auxiliary registers (ARs), index
register (INDX), and auxiliary register compare register (ARCR). The ARAU
can autoindex the current AR while the data memory location is being
addressed and can index either by 1 or by the contents of the INDX. As a
result, accessing data does not require the CALU for address manipulation;
therefore, the CALU is free for other operations in parallel. For information on
the ARAU, see Section 3.4, Auxiliary Register Arithmetic Unit (ARAU), on
page 3-17.

2.2.4 Memory-Mapped Registers


The ’C5x has 96 registers mapped into page 0 of the data memory space. All
’C5x DSPs have 28 CPU registers and 16 input/output (I/O) port registers but
have different numbers of peripheral and reserved registers (see Chapter 4,
Memory). Since the memory-mapped registers are a component of the data
memory space, they can be written to and read from in the same way as any
other data memory location. The memory-mapped registers are used for indi-
rect data address pointers, temporary storage, CPU status and control, or inte-
ger arithmetic processing through the ARAU. For information on registers, see
Section 3.5, Summary of Registers, on page 3-21.

2.2.5 Program Controller


The program controller contains logic circuitry that decodes the operational
instructions, manages the CPU pipeline, stores the status of CPU operations,
and decodes the conditional operations. Parallelism of architecture lets the
’C5x perform three concurrent memory operations in any given machine cycle:
fetch an instruction, read an operand, and write an operand. See Chapter 4,
Program Control, and Chapter 7, Pipeline. The program controller consists of
these elements:

- Program counter
- Status and control registers
- Hardware stack
- Address generation logic
- Instruction register

Architectural Overview 2-5


On-Chip Memory unit-5 DSP Processors

2.3 On-Chip Memory


The ’C5x architecture contains a considerable amount of on-chip memory to
aid in system performance and integration:

- Program read-only memory (ROM)


- Data/program dual-access RAM (DARAM)
- Data/program single-access RAM (SARAM)

The ’C5x has a total address range of 224K words  16 bits. The memory
space is divided into four individually selectable memory segments: 64K-word
program memory space, 64K-word local data memory space, 64K-word input/
output ports, and 32K-word global data memory space. For information on the
memory organization, see Chapter 8, Memory.

2.3.1 Program ROM


All ’C5x DSPs carry a 16-bit on-chip maskable programmable ROM (see
Table 1–1 for sizes). The ’C50 and ’C57S DSPs have boot loader code resi-
dent in the on-chip ROM, all other ’C5x DSPs offer the boot loader code as an
option. This memory is used for booting program code from slower external
ROM or EPROM to fast on-chip or external RAM. Once the custom program
has been booted into RAM, the boot ROM space can be removed from pro-
gram memory space by setting the MP/MC bit in the processor mode status
register (PMST). The on-chip ROM is selected at reset by driving the MP/MC
pin low. If the on-chip ROM is not selected, the ’C5x devices start execution
from off-chip memory. For information on the program ROM, see Section 8.2,
Program Memory, on page 8-7.

The on-chip ROM may be configured with or without boot loader code. Howev-
er, the on-chip ROM is intended for your specific program. Once the program
is in its final form, you can submit the ROM code to Texas Instruments for
implementation into your device. For details on how to submit code to Texas
Instruments to program your ROM, see Appendix F, Submitting ROM Codes
to TI.

2.3.2 Data/Program Dual-Access RAM


All ’C5x DSPs carry a 1056-word  16-bit on-chip dual-access RAM (DARAM).
The DARAM is divided into three individually selectable memory blocks:
512-word data or program DARAM block B0, 512-word data DARAM block B1,
and 32-word data DARAM block B2. The DARAM is primarily intended to store
data values but, when needed, can be used to store programs as well. DARAM
blocks B1 and B2 are always configured as data memory; however, DARAM

2-6
On-Chip
unit-5 DSP Memory
Processors

block B0 can be configured by software as data or program memory. The


DARAM can be configured in one of two ways:
- All 1056 words  16 bits configured as data memory

- 544 words  16 bits configured as data memory and 512 words × 16 bits
configured as program memory
DARAM improves the operational speed of the ’C5x CPU. The CPU operates
with a 4-deep pipeline. In this pipeline, the CPU reads data on the third stage
and writes data on the fourth stage. Hence, for a given instruction sequence,
the second instruction could be reading data at the same time the first instruc-
tion is writing data. The dual data buses (DB and DAB) allow the CPU to read
from and write to DARAM in the same machine cycle. For information on
DARAM, see Section 8.3, Local Data Memory, on page 8-15.

2.3.3 Data/Program Single-Access RAM


All ’C5x DSPs except the ’C52 carry a 16-bit on-chip single-access RAM
(SARAM) of various sizes (see Table 1–1). Code can be booted from an off-
chip ROM and then executed at full speed, once it is loaded into the on-chip
SARAM. The SARAM can be configured by software in one of three ways:
- All SARAM configured as data memory
- All SARAM configured as program memory
- SARAM configured as both data memory and program memory

The SARAM is divided into 1K- and/or 2K-word blocks contiguous in address
memory space. All ’C5x CPUs support parallel accesses to these SARAM
blocks. However, one SARAM block can be accessed only once per machine
cycle. In other words, the CPU can read from or write to one SARAM block
while accessing another SARAM block. When the CPU requests multiple
accesses, the SARAM schedules the accesses by providing a not-ready
condition to the CPU and executing the multiple accesses one cycle at a time.
SARAM supports more flexible address mapping than DARAM because
SARAM can be mapped to both program and data memory space simulta-
neously. However, because of simultaneous program and data mapping, an
instruction fetch and data fetch that could be performed in one machine cycle
with DARAM may take two machine cycles with SARAM. For information on
SARAM, see Section 8.3, Local Data Memory, on page 8-15.

2.3.4 On-Chip Memory Protection


The ’C5x DSPs have a maskable option that protects the contents of on-chip
memories. When the related bit is set, no externally originating instruction can
access the on-chip memory spaces. For information on the protection feature,
see subsection 8.2.4, Program Memory Protection Feature, on page 8-14.

Architectural Overview 2-7


On-Chip Peripherals unit-5 DSP Processors

2.4 On-Chip Peripherals

All ’C5x DSPs have the same CPU structure; however, they have different on-
chip peripherals connected to their CPUs. The ’C5x DSP on-chip peripherals
available are:

- Clock generator
- Hardware timer
- Software-programmable wait-state generators
- Parallel I/O ports
- Host port interface (HPI)
- Serial port
- Buffered serial port (BSP)
- Time-division multiplexed (TDM) serial port
- User-maskable interrupts

2.4.1 Clock Generator

The clock generator consists of an internal oscillator and a phase-locked loop


(PLL) circuit. The clock generator can be driven internally by a crystal resona-
tor circuit or driven externally by a clock source. The PLL circuit can generate
an internal CPU clock by multiplying the clock source by a specific factor, so
you can use a clock source with a lower frequency than that of the CPU. For
information, see Section 9.2, Clock Generator, on page 9-7.

2.4.2 Hardware Timer

A 16-bit hardware timer with a 4-bit prescaler is available. This programmable


timer clocks at a rate that is between 1/2 and 1/32 of the machine cycle rate
(CLKOUT1), depending upon the timer’s divide-down ratio. The timer can be
stopped, restarted, reset, or disabled by specific status bits. For information,
see Section 9.3, Timer, on page 9-9.

2.4.3 Software-Programmable Wait-State Generators

Software-programmable wait-state logic is incorporated in ’C5x DSPs allow-


ing wait-state generation without any external hardware for interfacing with
slower off-chip memory and I/O devices. This feature consists of multiple wait-
state generating circuits. Each circuit is user-programmable to operate in
different wait states for off-chip memory accesses. For information, see Sec-
tion 9.4, Software-Programmable Wait-State Generators, on page 9-13.

2-8
On-Chip
unit-5 Peripherals
DSP Processors

2.4.4 Parallel I/O Ports

A total of 64K I/O ports are available, sixteen of these ports are
memory-mapped in data memory space. Each of the I/O ports can be ad-
dressed by the IN or the OUT instruction. The memory-mapped I/O ports can
be accessed with any instruction that reads from or writes to data memory. The
IS signal indicates a read or write operation through an I/O port. The ’C5x can
easily interface with external I/O devices through the I/O ports while requiring
minimal off-chip address decoding circuits. For information, see Section 9.6,
Parallel I/O Ports, on page 9-22.

Table 2–1 lists the number and type of parallel ports available in ’C5x DSPs
with various package types.

2.4.5 Host Port Interface (HPI)

The HPI available on the ’C57S and ’LC57 is an 8-bit parallel I/O port that pro-
vides an interface to a host processor. Information is exchanged between the
DSP and the host processor through on-chip memory that is accessible to both
the host processor and the ’C57. For information, see Section 9.10, Host Port
Interface, on page 9-87.

ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
Table 2–1. Number of Serial/Parallel Ports Available in Different ’C5x Package Types

ÁÁÁÁÁ
ÁÁÁÁÁ
TMS320

ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
Device ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
Package

ÁÁÁÁÁÁ
ID†
ÁÁÁÁÁÁ
High-Speed
Serial Port
TDM
Serial Port
Buffered
Serial Port
Host Port
(Parallel)

ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
’C50/’LC50

ÁÁÁÁÁ ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
’C51/’LC51
ÁÁÁÁÁÁ
ÁÁÁÁÁ
PQ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
PQ/PZ
1
1
1
1



ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
’C52/’LC52

ÁÁÁÁÁ ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
’C53/’LC53
ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
PJ/PZ

ÁÁÁÁÁÁ
PQ
ÁÁÁÁÁÁ
1
1

1



ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
’C53S/’LC53S
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
PZ

ÁÁÁÁÁÁ
2 – – –

ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’LC56 PZ 1 – 1 –

ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’C57S/’LC57S PGE 1 – 1 1

ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’LC57 PBK 1 – 1 1

† PGE is a 20 × 20 × 1.4 mm thin quad flat-pack (TQFP) package


PJ is a 14 × 20 × 2.7 mm quad flat-pack (QFP) package
PQ is a 20 × 20 × 3.8 mm bumpered quad flat-pack (BQFP) package
PZ and PBK are a 14 × 14 × 1.4 mm thin quad flat-pack (TQFP) package

Architectural Overview 2-9


On-Chip Peripherals unit-5 DSP Processors

2.4.6 Serial Port


Three different kinds of serial ports are available: a general-purpose serial
port, a time-division multiplexed (TDM) serial port, and a buffered serial port
(BSP). Each ’C5x contains at least one general-purpose, high-speed synchro-
nous, full-duplexed serial port interface that provides direct communication
with serial devices such as codecs, serial analog-to-digital (A/D) converters,
and other serial systems. The serial port is capable of operating at up to one-
fourth the machine cycle rate (CLKOUT1). The serial port transmitter and re-
ceiver are double-buffered and individually controlled by maskable external in-
terrupt signals. Data is framed either as bytes or as words.
Table 2–1 lists the number and type of serial ports available in ’C5x DSPs with
various package types. For information on serial ports, see Section 9.7, Serial
Port Interface, on page 9-23.

2.4.7 Buffered Serial Port (BSP)


The BSP available on the ’C56 and ’C57 devices is a full-duplexed, double-
buffered serial port and an autobuffering unit (ABU). The BSP provides flexibil-
ity on the data stream length. The ABU supports high-speed data transfer and
reduces interrupt latencies.
Table 2–1 lists the number and type of serial ports available in ’C5x DSPs with
various package types. For information, see Section 9.8, Buffered Serial Port
(BSP) Interface, on page 9-53.

2.4.8 TDM Serial Port


The TDM serial port available on the ’C50, ’C51, and ’C53 devices is a full-
duplexed serial port that can be configured by software either for synchronous
operations or for time-division multiplexed operations. The TDM serial port is
commonly used in multiprocessor applications.
Table 2–1 lists the number and type of serial ports available in ’C5x DSPs with
various package types. For information, see Section 9.9, Time-Division Multi-
plexed (TDM) Serial Port Interface, on page 9-74.

2.4.9 User-Maskable Interrupts


Four external interrupt lines (INT1–INT4) and five internal interrupts, a timer
interrupt and four serial port interrupts, are user maskable. When an interrupt
service routine (ISR) is executed, the contents of the program counter are
saved on an 8-level hardware stack, and the contents of eleven specific CPU
registers are automatically saved (shadowed) on a 1-level-deep stack. When
a return from interrupt instruction is executed, the CPU registers’ contents are
restored. For information, see Section 4.8, Interrupts, on page 4-36.

2-10
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors

3.2 Central Arithmetic Logic Unit (CALU)


The CALU components, shown in Figure 3–2, consists of the following:

- 16-bit  16-bit parallel multiplier


- 32-bit 2s-complement arithmetic logic unit (ALU)
- 32-bit accumulator (ACC)
- 32-bit accumulator buffer (ACCB)
- 0-, 1-, or 4-bit left or 6-bit right shifter
- 0- to 16-bit left barrel shifter
- 0- to 16-bit right barrel shifter
- 0- to 7-bit left barrel shifter

3.2.1 Multiplier, Product Register (PREG), and Temporary Register 0 (TREG0)


The 16-bit  16-bit hardware multiplier can compute a signed or an unsigned
32-bit product in a single machine cycle. All multiply instructions except the
multiply unsigned (MPYU) instruction perform a signed multiply operation in
the multiplier. That is, two numbers being multiplied are treated as 2s-comple-
ment numbers, and the result is a 32-bit 2s-complement number.

One input to the multiplier is from memory-mapped temporary register 0


(TREG0), and the other input is from the data bus or the program bus. The
32-bit result from the multiplier is stored in the PREG and is available to the
ALU. The ALU uses the 16-bit words taken from data memory or derived from
an immediate instruction, or the ALU uses the 32-bit result stored in the PREG
to perform arithmetic operations. The ALU can also perform Boolean opera-
tions. The 32-bit result from the ALU is stored in the ACC; the ACC also sup-
plies the second input to the ALU. Instructions are provided for storing the high-
and low-order accumulator words in memory. The shifters (p-scaler, prescaler,
and postscaler) make it possible for the CALU to perform numerical scaling,
bit extraction, extended-precision arithmetic, and overflow prevention. These
shifters are connected to the output of the PREG and the ACC.

The four product shift modes (PM) at the PREG output are useful for perform-
ing multiply/accumulate operations and fractional arithmetic and for justifying
fractional products. The PM field of status register ST1 specifies the PM shift
mode of the p-scaler:

- If PM = 002, the PREG 32-bit output is not shifted when transferred into the
ALU or stored.

- If PM = 012, the PREG output is left-shifted 1 bit when transferred into the
ALU or stored, and the LSB is zero filled. This shift mode compensates for
the extra sign bit gained when multiplying two 16-bit 2s-complement num-
bers.

Central Processing Unit (CPU) 3-7


Central Arithmetic Logic Unit (CALU) unit-5 DSP Processors

Figure 3–2. Central Arithmetic Logic Unit


Data Bus

MUX

TREG0 TREG1(5)

Multiplier
PRESCALER
SFL(0–16) PREG(32)
32

MUX P–SCALER
(–6,0,1,4)
32
32 32
PRESCALER
SFR(0–16)

MUX
32

32 ALU(32)
32

ST1 C(1) 32

Program Bus
ACCH ACCL ACCB(32)
32

POSTSCALER
(0–7)

Data Bus

Notes: All registers and data lines are 16-bits wide unless otherwise specified.

- If PM = 102, the PREG output is left-shifted 4 bits when transferred into the
ALU or stored, and the 4 LSBs are zero filled. This shift mode is used in
conjunction with the MPY instruction with a short immediate value (13 bits
or less) to eliminate the four extra sign bits gained when multiplying a16-bit
number times a 13-bit number.

- If PM = 112, the PREG output is right-shifted 6 bits, sign extended, when


transferred into the ALU or stored, and the 6 LSBs are lost. This shift mode
enables the execution of up to 128 consecutive multiply/accumulates with-
out the possibility of overflow. Note that the product is always sign extended,
regardless of the value of the sign extension mode (SXM) bit in ST1.

3-8
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors

The PM shifts also occur when the PREG contents are stored to data memory.
The PREG contents remain unchanged during the shifts.

The LT (load TREG0) instruction loads TREG0, from the data bus, with the first
operand; the MPY instruction provides the second operand for multiplication
operations. To perfrom a multiplication with a short or long immediate operand,
use the MPY instruction with an immediate operand. A product can be ob-
tained every two cycles except when a long immediate operand is used.

Four multiply/accumulate instructions (MAC, MACD, MADD, and MADS) fully


utilize the computational bandwidth of the multiplier, which allows both oper-
ands to be processed simultaneously. The data for these operations can be
transferred to the multiplier each cycle via the program and data buses. When
any of the four multiply/accumulate instructions are used with the RPT or
RPTZ instruction, the instruction becomes a single-cycle multiply/accumulate
function. In these repeated instructions, the coefficient addresses are gener-
ated by the PC while the data addresses are generated by the ARAU. This al-
lows the RPT instruction to sequentially access the values from the coefficient
table and step through the data in any of the indirect addressing modes. The
RPTZ instruction also clears the ACC and the PREG to initialize the multiply/
accumulate operation.

For example, consider multiplying the row of one matrix times the column of
a second matrix: there are 10  10 matrices, MTRX1 points to the beginning
of the first matrix, INDX = 10, and the current AR points to the beginning of the
second matrix:

RPTZ #9 ;For i = 0, i < 10, i++


MAC MTRX1,*0+ ;PREG=DATA(MTRX1+i) x DATA[MTRX2 +
;(i x INDX)]
;ACC += PREG.
APAC ;ACC += PREG.

The MAC and MACD instructions obtain their coefficient pointer from a long
immediate address and are, therefore, 2-word instructions. The MADS and
MADD instructions obtain their coefficient pointer from the BMAR and are,
therefore, 1-word instructions. When you use the BMAR as a source to the co-
efficient table, one block of code can support multiple applications, and you
can change the long immediate address without modifying executable code.
The MACD and MADD instructions include a data move (DMOV) operation
that, in conjunction with the fetch of the data multiplicand, writes the data value
to the next higher data address.

Central Processing Unit (CPU) 3-9


Central Arithmetic Logic Unit (CALU) unit-5 DSP Processors

The MACD and MADD instructions, when repeated, support filter constructs
(weighted running averages) so that as the sum-of-products operation is ex-
ecuted, the sample data is shifted in memory to make room for the next sample
and to throw away the oldest sample. Circular addressing with MAC and
MADS instructions can also be used to support filter implementation.

In the next example, the current AR points to the oldest of the samples; BMAR
points to the coefficient table. In addition to initiating the repeat operation, the
RPTZ instruction also clears the ACC and the PREG. In this example, the PC
is stored in a temporary register while the repeated operation is executed.
Next, the PC is loaded with the value stored in BMAR. The program bus is used
to address the coefficients and, as the MADD instruction is repeatedly ex-
ecuted, the PC increments to step through the coefficient table. The ARAU
generates the address of the sample data.

Indirect addressing with decrement steps through the sample data, starting
with the oldest data. As the data is fetched, it is also written to the next higher
location in data memory. This operation aligns the data for the next execution
of the filter by moving the oldest sample out past the end of the sample’s array
and making room for the new sample at the beginning of the sample array. The
previous product of the PREG is added to the ACC, while the two fetched val-
ues are multiplied and the new product value is loaded into the PREG. Note
that the DMOV portion of the MACD and MADD instructions does not function
with external data memory addresses.

RPTZ #9 ;ACC = PREG = 0. For I = 9 TO 0 Do


MADD *– ;SUM AI x XI. XI+1 = XI.
APAC ;FINAL SUM.

The MPYU instruction performs an unsigned multiplication that facilitates ex-


tended-precision arithmetic operations. The unsigned contents of TREG0 are
multiplied by the unsigned contents of the addressed data memory location;
the result is placed in PREG. This allows operands larger than 16 bits to be
broken down into 16-bit words and processed separately to generate products
larger than 32 bits. The square/add (SQRA) and square/subtract (SQRS) in-
structions pass the same value to both inputs of the multiplier for squaring a
data memory value.

After the multiplication of two 16-bit numbers, this 32-bit product is loaded into
PREG. The product from the PREG can be transferred to the ALU or to data
memory via the store product high (SPH) and store product low (SPL) instruc-
tions.

3-10
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors

3.2.2 Arithmetic Logic Unit (ALU) and Accumulators

The 32-bit general-purpose ALU and ACC implement a wide range of arithme-
tic and logical functions, the majority of which execute in a single clock cycle.
Once an operation is performed in the ALU, the result is transferred to the
ACC, where additional operations, such as shifting, can occur. Data that is in-
put to the ALU can be scaled by the prescaler.

The following steps occur in the implementation of a typical ALU instruction:

1) Data is fetched from memory on the data bus,

2) Data is passed through the prescaler and the ALU, where the arithmetic
is performed, and

3) The result is moved into the ACC.

The ALU operates on 16-bit words taken from data memory or derived from
immediate instructions. In addition to the usual arithmetic instructions, the ALU
can perform Boolean operations, thereby facilitating the bit manipulation abil-
ity required of a high-speed controller. One input to the ALU is always supplied
by the ACC. The other input can be transferred from the PREG of the multiplier,
the ACCB, or the output of the prescaler (that has been read from data memory
or from the ACC). After the ALU has performed the arithmetic or logical opera-
tion, the result is stored in the ACC. For the following example, assume that
ACC = 0, PREG = 0022 2200h, PM = 002, and ACCB = 0033 3300h:

LACC #01111h,8 ;ACC = 00111100h. Load ACC from prescaling


;shifter
APAC ;ACC = 00333300h. Add to ACC the
;product register.
ADDB ;ACC = 00666600h. Add to ACC the
;accumulator buffer.

The 32-bit ACC can be split into two 16-bit segments (ACCH and ACCL) for
storage in data memory (see Figure 3–2). A postscaler at the output of the
ACC provides a left shift of 0 to 7 places. This shift is performed while the data
is being transferred to the data bus for storage. The contents of the ACC re-
main unchanged. When the postscaler is used on the high word of the ACC
(bits 16 – 31), the MSBs are lost and the LSBs are filled with bits shifted in from
the low word (bits 0 – 15). When the postscaler is used on the low word, the
LSBs are zero filled. For the following example, assume that
ACC = FF23 4567h:

SACL TEMP1,7 ;TEMP1 = B380h ACC = FF234567h.


SACH TEMP2,7 ;TEMP2 = 91A2h ACC = FF234567h.

Central Processing Unit (CPU) 3-11


Central Arithmetic Logic Unit (CALU) unit-5 DSP Processors

The ’C5x supports floating-point operations for applications requiring a large


dynamic range. By performing left shifts, the NORM (normalization) instruction
normalizes fixed-point numbers contained in the ACC. The four bits of the
TREG1 define a variable shift through the prescaler for the add to/load to/sub-
tract from accumulator with shift specified by TREG1 (ADDT/LACT/SUBT)
instructions. These instructions denormalize a number (convert it from float-
ing-point to fixed-point) and also execute an automatic gain control (AGC)
going into a filter.

The single-cycle 1-bit to 16-bit right shift of the ACC can efficiently align its con-
tents. This shift, coupled with the 32-bit temporary buffer on the ACC, en-
hances the effectiveness of the CALU in extended-precision arithmetic. The
ACCB provides a temporary storage place for a fast save of the ACC. The
ACCB can also be used as an input to the ALU. The minimum or maximum
value in a string of numbers can be found by comparing the contents of the
ACCB with the contents of the ACC. The minimum or maximum value is placed
in both registers, and, if the condition is met, the carry bit (C) is set. The mini-
mum and maximum functions are executed by the CRLT and CRGT instruc-
tions, respectively. These operations are signed arithmetic operations. In the
next example, assume that ACC = 1234 5678h and ACCB = 7654 3210h:
CRLT ;ACC = ACCB = 1234 5678h. C = 1.
CRGT ;ACC = ACCB = 7654 3210h. C = 0.

The ACC overflow saturation mode can be enabled by setting and disabled by
clearing the overflow mode (OVM) bit of ST0. When the ACC is in the overflow
saturation mode and an overflow occurs, the overflow flag is set and the ACC
is loaded with either the most positive or the most negative value represent-
able in the ACC, depending upon the direction of the overflow. The value of
the ACC upon saturation is 7FFF FFFFh (positive) or 8000 0000h (negative).
If the OVM bit is cleared and an overflow occurs, the overflowed results are
loaded into the ACC without modification. Note that logical operations cannot
result in overflow.

The ’C5x can execute a variety of branch instructions that depend on the status
of the ALU and the ACC. For example, execution of the instruction BCND can
depend on a variety of conditions in the ALU and the ACC. The BACC instruc-
tion allows branching to an address stored in the ACC. The bit test instructions
(BITT and BIT) facilitate branching on the condition of a specified bit in data
memory.

3-12
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors

The ACC has an associated carry bit that is set or cleared, depending on vari-
ous operations within the ’C5x. The carry bit allows more efficient computation
of extended-precision products and additions or subtractions; it is also useful
in overflow management. The carry bit is affected by most arithmetic instruc-
tions as well as the single-bit shift and rotate instructions. The carry bit is not
affected by loading the ACC, logical operations, or other nonarithmetic or con-
trol instructions. Examples of carry bit operations are shown in Figure 3–3.

Figure 3–3. Examples of Carry Bit Operations


C MSB LSB C MSB LSB
X F F F F F F F F ACC X 0 0 0 0 0 0 0 0 ACC
+ 1 – 1
1 0 0 0 0 0 0 0 0 0 F F F F F F F F

C MSB LSB C MSB LSB


X 7 F F F F F F F ACC X 8 0 0 0 0 0 0 1 ACC
+ 1 (OVM = 0) – 2(OVM = 0)
0 8 0 0 0 0 0 0 0 1 7 F F F F F F F

C MSB LSB C MSB LSB


1 0 0 0 0 0 0 0 0 ACC 0 F F F F F F F F ACC
+ 0 (ADDC) – 1 (SUBB)
0 0 0 0 0 0 0 0 1 1 F F F F F F F D

The value added to or subtracted from the ACC can come from the prescaler,
ACCB, or PREG. The carry bit is set if the result of an addition or accumulation
process generates a carry; it is cleared if the result of a subtraction generates
a borrow. Otherwise, it is cleared after an addition or set after a subtraction.

The add to ACC with carry (ADDC) and add ACCB to ACC with carry (ADCB)
instructions use the previous value of carry in their addition operation. The
subtract from ACC with borrow (SUBB) and subtract ACCB from ACC with bor-
row (SBBB) instructions use the logical inversion of the previous value of carry.
The one exception to the operation of the carry bit is in the use of ADD with
a shift count of 16 (add to ACCH) and SUB with a shift count of 16 (subtract
from ACCH). These instructions can generate a carry or a borrow, but they will
not clear a carry or borrow, as is normally the case if a carry or borrow is not
generated. This feature is useful for extended-precision arithmetic.
Two conditional operands, C and NC, are provided for branching, calling, re-
turning, and conditionally executing according to the status of the carry bit. The
CLRC, LST #1, and SETC instructions can be used to load the carry bit. The
carry bit is set on a reset.

The 1-bit shift to the left (SFL) or right (SFR) and the rotate to the left (ROL)
or right (ROR) instructions shift or rotate the contents of the ACC through the

Central Processing Unit (CPU) 3-13


Central Arithmetic Logic Unit (CALU) unit-5 DSP Processors

carry bit. The SXM bit affects the definition of the shift accumulator right (SFR)
instruction. When SXM = 1, SFR performs an arithmetic right shift, maintaining
the sign of the ACC data. When SXM = 0, SFR performs a logical shift, shifting
out the LSBs and shifting in a 0 for the MSB. The shift accumulator left (SFL)
instruction is not affected by the SXM bit and behaves the same in both cases,
shifting out the MSB and shifting in a 0. The RPT and RPTZ instructions can
be used with the shift and rotate instructions for multiple-bit shifts.

The SFLB, SFRB, RORB, and ROLB instructions can shift or rotate the 65-bit
combination of the ACC, ACCB, and carry bit as described above.

The ACC can also be shifted 0–31 bits right in two instruction cycles or 1–16
bits right in one cycle. The bits shifted out are lost, and the bits shifted in are
either 0s or copies of the original sign bit, depending on the value of the SXM
bit. A shift count of 1 to 16 is embedded in the instruction word of the BSAR
instruction. For example, let ACC = 1234 5678h:
BSAR 7 ;ACC = 0246 8ACEh.

The right shift can also be controlled via TREG1. The SATL instruction shifts
the ACC by 0–15 bits, as defined by bits 0–3 of TREG1. The SATH instruction
shifts the ACC 16 bits to the right if bit 4 of TREG1 is a 1. The following code
sequence executes a 0- to 31-bit right shift of the ACC, depending on the shift
count stored at SHIFT. For example, consider the value stored at
SHIFT = 01Bh and ACC = 1234 5678h:
LMMR TREG1,SHIFT ;TREG1 = shift count 0 – 31. TREG1 = 1B
SATH ;If shift count > 15, then ACC >> 16
;ACC = 00001234
SATL ;ACC >> shift count. ACC = 0000 0002

3.2.3 Scaling Shifters and Temporary Register 1 (TREG1)


The prescaler has a 16-bit input connected to the data bus and a 32-bit output
connected to the ALU (see Figure 3–2). The prescaler produces a left shift of
0 to 16 bits on the input data. The shift count is specified by a constant em-
bedded in the instruction word or by the value in TREG1. The LSBs of the out-
put are filled with 0s; the MSBs can be filled with 0s or sign-extended, depend-
ing upon the value of the SXM bit of ST1.

The p-scaler and postscaler make it possible for the CALU to perform numeri-
cal scaling, bit extraction, extended-precision arithmetic, and overflow preven-
tion. These shifters are connected to the output of the PREG and the ACC (see
Figure 3–2 on page 3-8).

3-14
Parallel LogicProcessors
unit-5 DSP Unit (PLU)

3.3 Parallel Logic Unit (PLU)


The parallel logic unit (PLU) can directly set, clear, test, or toggle multiple bits
in a control/status register or any data memory location. The PLU provides a
direct logic operation path to data memory values without affecting the con-
tents of the ACC or the PREG (see Figure 3–4).
The PLU executes a read-modify-write operation on data stored in data space.
First, one operand is fetched from data memory space, and the second is
fetched from a long immediate on the program bus or from the dynamic bit ma-
nipulation register (DBMR). Then, the PLU executes a logical operation on the
two operands as defined by the instruction. The result is written to the same
data memory location from which the first operand was fetched.

Figure 3–4. Parallel Logic Unit Block Diagram


Data Bus

DBMR

MUX

Program Bus
PLU

Note: All registers and data lines are 16-bits wide unless otherwise specified.

The PLU makes it possible to directly manipulate bits in any location in data
memory space by ANDing, ORing, exclusive-ORing, or loading a 16-bit long
immediate value to a data location. For example, to use AR1 for circular buffer
1 and AR2 for circular buffer 2 but not enable the circular buffers, initialize the
circular buffer control register (CBCR) by executing the following code:
SPLK #021h,CBCR ;Store peripheral long immediate
;(DP = 0).
Next, enable circular buffers 1 and 2 by executing the code:
OPL #088h,CBCR ;Set bit 7 and bit 3 in CBCR.

Central Processing Unit (CPU) 3-15


Parallel Logic Unit (PLU) unit-5 DSP Processors

To test for individual bits in a specific register or data word, use the BIT instruc-
tion; however, to test for a pattern of bits, use the compare parallel long imme-
diate (CPL) instruction. If the data value is equal to the long immediate value,
then the test/control (TC) bit in ST1 is set. The TC bit is set if the result of any
PLU instruction is 0.

The set, clear, and toggle functions can be executed with a 16-bit dynamic reg-
ister value instead of the long immediate value. This is done with the following
three instructions: AND DBMR register to data (APL), OR DBMR register to
data (OPL), and exclusive-OR DBMR register to data (XPL).

The TC bit is also set by the APL, OPL, and XPL instructions if the result of the
PLU operation (value written back into data memory) is 0. This allows bits to
be tested and cleared simultaneously. For example,
APL #0FF00h,TEMP ;Clear low byte and check for
;bits set in high byte.
BCND HIGH_BITS_SET,NTC ;If bits active in high byte,
;then branch.

or

XPL #1,TEMP ;Toggle bit 0.


BCND BIT_SET,TC ;If bit was set, branch. If not,
;bit set now.

In the first example, the low byte of a flag word is cleared while the high byte
is checked for any active flags (bits = 1). If none of the flags in the high byte
is set, then the resulting APL operation yields a 0 to TEMP and the TC bit is
set. If any of the flags in the high byte are set, then the resulting APL operation
yields a nonzero value to TEMP and the TC bit is cleared. Therefore, the condi-
tional branch (BCND) following the APL instruction branches if any of the bits
in the high byte are nonzero. The second example tests the flag. If the flag is
low, the flag is set high; if the flag is high, the flag is cleared and the branch is
taken. The PLU instructions can operate anywhere in data address space, so
they can operate with flags stored in RAM locations as well as in control regis-
ters for both on- and off-chip peripherals. The PLU instructions are listed in
Table 6–6 on page 6-14.

3-16
Auxiliary Registerunit-5
Arithmetic
DSPUnit (ARAU)
Processors

3.4 Auxiliary Register Arithmetic Unit (ARAU)

The auxiliary register file contains eight memory-mapped auxiliary registers


(AR0–AR7), which can be used for indirect addressing of the data memory or
for temporary data storage. Indirect auxiliary register addressing (see
Figure 3–5) allows placement of the data memory address of an instruction
operand into one of the AR. The ARs are pointed to by a 3-bit auxiliary register
pointer (ARP) that is loaded with a value from 0–7, designating AR0–AR7, re-
spectively. The ARs and the ARP can be loaded from data memory, the ACC
or the PREG or by an immediate operand defined in the instruction. The con-
tents of the ARs can be stored in data memory or used as inputs to the CALU.
The memory-mapped ARs reside in data page 0, as described in subsection
8.3.2, Local Data Memory Address Map, on page 8-17.

The auxiliary register file (AR0–AR7) is connected to the auxiliary register


arithmetic unit (ARAU), shown in Figure 3–6. The ARAU can autoindex the
current AR while the data memory location is being addressed; it indexes
either by ±1 or by the contents of the index register (INDX). As a result, the
CALU is not needed for address manipulation when tables of information are
accessed; it is free for other operations in parallel. For more advanced address
manipulation, such as multidimensional array addressing, the CALU can
directly read from or write to the ARs.

Figure 3–5. Indirect Auxiliary Register Addressing Example

Auxiliary Register File Data Memory Map

AR0 0 5 3 7 h Location
00 0 0 h
AR1 5 1 5 0 h
Auxiliary Register
Pointer
AR2 0 E 9 F C h
(in ST0)
ARP 0 1 1 AR3 0 F F 3 A h 0 FF3 Ah 3121h

AR4 1 0 3 B h 0 FFFFh

AR5 2 6 B 1 h

AR6 0 0 0 8 h

AR7 8 4 3 D h

Central Processing Unit (CPU) 3-17


Auxiliary Register Arithmetic Unit (ARAU) unit-5 DSP Processors

Figure 3–6. Auxiliary Register Arithmetic Unit


To

MUX
A15–A0 Program
Control

3 AR0 IREG
AR1
AR2
AR3
ST0 ARP(3) AR4
3
AR5
AR6
ST1 ARB(3) AR7
CBCR(8)
3 CBSR1
CBSR2
CBER1
MUX CBER2
INDX DRB
ARCR

Program Bus
16
Data Bus

ARAU MUX

MUX MUX

SARAM DARAM B0 DARAM B2


B1

Notes: All registers and data lines are 16-bits wide unless otherwise specified.

The ARAU updates the ARs during the decode phase (second stage)
of the pipeline, while the CALU writes during the execution phase
(fourth stage). Therefore, the two instructions that immediately follow
the CALU write to an AR should not use the same AR for address
generation. See Chapter 7, Pipeline, for more details.

As shown in Figure 3–6, the INDX, auxiliary register compare register


(ARCR), or eight LSBs of the instruction register (IREG) can be used as one
of the inputs to the ARAU. The other input is provided by the contents of the
current AR pointed to by ARP. Table 3–2 defines the functions of the ARAU.

3-18
Auxiliary Registerunit-5
Arithmetic
DSPUnit (ARAU)
Processors

Table 3–2. Auxiliary Register Arithmetic Unit Functions

Function Description
Current AR + INDX → Current AR Index the current AR by adding an unsigned 16-bit
integer contained in INDX. Example: ADD *0+

Current AR – INDX → Current AR Index the current AR by subtracting an unsigned


16-bit integer contained in INDX. Example: ADD *0–

Current AR + 1 → Current AR Increment the current AR by 1. Example: ADD *+

Current AR – 1 → Current AR Decrement the current AR by 1. Example: ADD *–

Current AR → Current AR Do not modify the current AR. Example: ADD *

Current AR + IR(7–0) → Current AR Add an 8-bit immediate value to current AR. Exam-
ple: ADRK #55h

Current AR – IR(7–0) → Current AR Subtract an 8-bit immediate value from the current
AR. Example: SBRK #55h

Current AR + rc(INDX) → Current AR Bit-reversed indexing; add INDX with reversed-carry


(rc) propagation. Example: ADD *BR0+

Current AR – rc(INDX) → Current AR Bit-reversed indexing; subtract INDX with reversed-


carry (rc) propagation. Example: ADD *BR0–

If (Current AR) = (ARCR), then TC = 1 Compare the current AR to ARCR and, if the condi-
If (Current AR) < (ARCR), then TC = 1 tion is true, then set the TC bit of the status register
If (Current AR) > (ARCR), then TC = 1 ST1. If false, then clear the TC bit. Example: CMPR 3
If (Current AR) ≠ (ARCR), then TC = 1

If (Current AR) = (CBER), then Current AR = CBSR If the current AR is at the end of circular buffer, reload
the start address. The test for this condition is per-
formed before the execution of the AR modification.
Example: ADD *+

The INDX can be added to or subtracted from the current AR on any AR update
cycle. The INDX can be used to increment or decrement the address in steps
larger than 1; this is useful for operations such as addressing down a matrix
column. The ARCR limits blocks of data and supports logical comparisons be-
tween the current AR and ARCR in conjunction with the CMPR instruction.
Note that the ’C2x uses AR0 for this implementation. After reset, you can use
the load auxiliary register (LAR) instruction to load AR0; if the enable extra in-
dex register (NDX) bit in the PMST is set, LAR also loads INDX and ARCR to
maintain compatibility with the ’C2x.

Central Processing Unit (CPU) 3-19


Auxiliary Register Arithmetic Unit (ARAU) unit-5 DSP Processors

Because the ARs are memory-mapped, the CALU can act directly upon them
and use more advanced indirect addressing techniques. For example, the
multiplier can calculate the addresses of 3-dimensional matrices. After a
CALU load of the AR, there is, however, a 2-instruction-cycle delay before the
ARs can be used for address generation. The INDX and ARCR are accessible
via the CALU, regardless of the condition of the NDX bit (that is, SAMM ARCR
writes only to the ARCR).

The ARAU can serve as an additional general-purpose arithmetic unit be-


cause the auxiliary register file can directly communicate with data memory.
The ARAU implements 16-bit unsigned arithmetic, whereas the CALU imple-
ments 32-bit 2s-complement arithmetic. The BANZ and BANZD instructions
permit the ARs to be used as loop counters.

The 3-bit auxiliary register pointer buffer (ARB), shown in Figure 3–6, stores
the ARP on subroutine calls when the automatic context switch feature of the
’C5x is not used.

Two circular buffers can operate at a given time and are controlled via the cir-
cular buffer control register (CBCR). Upon reset (rising edge of RS), both circu-
lar buffers are disabled. To define a circular buffer, load CBSR1 or CBSR2 with
the start address of the buffer and CBER1 or CBER2 with the end address;
then load the AR to be used with the circular buffer with an address between
the start and end addresses. Finally, load CBCR with the appropriate AR num-
ber and set the enable (CENB1 or CENB2) bit.

Do not use the same AR to access both circular buffers or unexpected


results will occur.

As the address is stepping through the circular buffer, the AR value is com-
pared against the value contained in CBER prior to the update to the AR value.
If the current AR value and the CBER are equal and an AR modification occurs,
the value contained in CBSR is automatically loaded into the AR. If the values
in the CBER and the AR are not equal, the AR is modified as specified.

Circular buffers can be used with either increment- or decrement-type up-


dates. If increment is used, then the value in CBER must be larger than the
value in CBSR. If decrement is used, the value in CBER must be smaller than
the value in CBSR. The other indirect addressing modes can be used; howev-
er, the ARAU tests only for the condition current AR = CBER. The ARAU does
not detect an AR update that steps over the value contained in CBER. See
Section 5.6, Circular Addressing, on page 5-21 for more details.

3-20
Summary
unit-5 DSP of Registers
Processors

3.5 Summary of Registers


CPU registers (except ST0 and ST1), peripheral registers, and I/O ports
occupy data memory space.

3.5.1 Auxiliary Registers (AR0–AR7)


The eight 16-bit auxiliary registers (AR0–AR7) can be accessed by the CALU
and modified by the ARAU or the PLU. The primary function of the ARs is to
provide a 16-bit address for indirect addressing to data space. However, the
ARs can also be used as general-purpose registers or counters. Section 5.2,
Indirect Addressing, on page 5-4 describes how the ARs are used in indirect
addressing. Use of ARs is described in Section 3.4 on page 3-17.

3.5.2 Auxiliary Register Compare Register (ARCR)


The 16-bit ARCR is used for address boundary comparison. The CMPR
instruction compares the ARCR to the selected AR and places the result of the
compare in the TC bit of ST1. Section 5.2, Indirect Addressing, on page 5-4
describes how the ARCR can be used in memory management. See also Sec-
tion 3.4 on page 3-17.

3.5.3 Block Move Address Register (BMAR)


The 16-bit BMAR holds an address value to be used with block moves and
multiply/accumulate operations. This register provides the 16-bit address for
an indirect-addressed second operand. See Section 5.4, Dedicated-Register
Addressing, on page 5-17.

3.5.4 Block Repeat Registers (RPTC, BRCR, PASR, PAER)


The 16-bit repeat counter register (RPTC) holds the repeat count in a repeat
single-instruction operation and is loaded by the RPT and RPTZ instructions.
See Section 4.6, Single Instruction Repeat Function, on page 4-22.

Although the RPTC is a memory-mapped register, you should avoid


writing to this register. Writing to this register can cause undesired
results.

Central Processing Unit (CPU) 3-21


Summary of Registers unit-5 DSP Processors

The 16-bit block repeat counter register (BRCR) holds the count value for the
block repeat feature. This value is loaded before a block repeat operation is
initiated. The value can be changed while a block repeat is in progress; howev-
er, take care to avoid infinite loops. The block repeat program address start
register (PASR) indicates the 16-bit address where the repeated block of code
starts. The block repeat program address end register (PAER) indicates the
16-bit address where the repeated block of code ends. The PASR and PAER
are loaded by the RPTB instruction. Block repeats are described in Section
4.7, Block Repeat Function, on page 4-31.

3.5.5 Buffered Serial Port Registers (ARR, AXR, BKR, BKX, SPCE)
The buffered serial port (BSP) is available on ’C56 and ’C57 devices. The BSP
comprises a full-duplex, double-buffered serial port interface and an autobuf-
fering unit (ABU). The BSP has a 2K-word buffer, which resides in the ’C5x
internal memory. Five registers control and operate the BSP. The 16-bit BSP
control extension register (SPCE) contains the mode control and status bits
of the BSP. The 11-bit BSP address receive register (ARR) and 11-bit BSP
receive buffer size register (BKR) support address generation for writing to the
data receive register (DRR) in the ’C5x internal memory. The 11-bit BSP
address transmit register (AXR) and 11-bit BSP transmit buffer size register
(BKX) support address generation for reading a word from the ’C5x internal
memory to the data transmit register (DXR). The BSP is described in Section
9.8, Buffered Serial port (BSP) Interface, on page 9-53.

3.5.6 Circular Buffer Registers (CBSR1, CBER1, CBSR2, CBER2, CBCR)


The ’C5x devices support two concurrent circular buffers operating in conjunc-
tion with user-specified auxiliary registers. Two 16-bit circular buffer start reg-
isters (CBSR1 and CBSR2) indicate the address where the circular buffer
starts. Two 16-bit circular buffer end registers (CBER1 and CBER2) indicate
the address where the circular buffer ends. The 16-bit circular buffer control
register (CBCR) controls the operation of these circular buffers and identifies
the auxiliary registers to be used. Section 5.6, Circular Addressing, on page
5-21 describes how circular buffers can be used in memory management.
Section 3.4 on page 3-17 describes how circular buffer registers are used in
addressing. See also subsection 4.4.1, Circular Buffer Control Register
(CBCR), on page 4-6.

3.5.7 Dynamic Bit Manipulation Register (DBMR)


The 16-bit DBMR is used in conjunction with the PLU as a dynamic (execution-
time programmable) mask register. The DBMR is described in Section 3.3 on
page 3-15.

3-22
Summary
unit-5 DSP of Registers
Processors

3.5.8 Global Memory Allocation Register (GREG)


The 16-bit GREG allocates parts of the local data space as global memory and
defines what amount of the local data space will be overlayed by global data
space. See Section 8.4, Global Data Memory, on page 8-20.

3.5.9 Host Port Interface Registers (HPIC, HPIA)


The 8-bit wide parallel host port interface (HPI) is available on the ’C57 device.
The HPI interfaces a host processor to the ’C57 device. The HPI control regis-
ter (HPIC) holds the control word. The host processor addresses HPI memory
via the HPI address register (HPIA). See Section 9.10, Host Port Interface
(’C57S and ’LC57 only), on page 9-87.

3.5.10 Index Register (INDX)


The 16-bit INDX is used by the ARAU as a step value (addition or subtraction
by more than 1) to modify the address in the ARs during indirect addressing.
For example, when the ARAU steps across a row of a matrix, the indirect
address is incremented by 1. However, when the ARAU steps down a column,
the address is incremented by the dimension of the matrix. The ARAU can add
or subtract the value stored in the INDX from the current AR as part of the indi-
rect address operation. INDX can also map the dimension of the address block
used for bit-reversal addressing. Section 5.2, Indirect Addressing, on page 5-4
describes how the INDX can be used in memory management. See also Sec-
tion 3.4 on page 3-17.

3.5.11 I/O Space (PA0–PA15)


The I/O space makes it possible to address 16 locations (50h–5Fh) of I/O
space via the addressing modes of the local data space. This means that these
locations can be read directly into the CALU or written from the ACC. It also
means that these locations can be acted upon by the PLU or addressed via
the memory-mapped addressing mode. The locations can also be addressed
with the IN and OUT instructions.

3.5.12 Instruction Register (IREG)


The 16-bit IREG holds the opcode of the instruction being executed. The IREG
is used during program control.

3.5.13 Interrupt Registers (IMR, IFR)


The 16-bit interrupt mask register (IMR) individually masks specific interrupts
at required times. The 16-bit interrupt flag register (IFR) indicates the current
status of the interrupts. The status of the interrupts is updated regardless of
the IMR and INTM bit in the ST0. Interrupts are described in Section 4.8, Inter-
rupts, on page 4-36.

Central Processing Unit (CPU) 3-23


Summary of Registers unit-5 DSP Processors

3.5.14 Processor Mode Status Register (PMST)


The 16-bit PMST contains status and control information for the ’C5x device.
Subsection 8.2.1, Program Memory Configurability, on page 8-7 and subsec-
tion 8.3.1, Local Data Memory Configurability, on page 8-15 describe how the
PMST configures memory. See also subsection 4.4.2, Processor Mode Status
Register (PMST), on page 4-7.

3.5.15 Product Register (PREG)


The 32-bit PREG holds the result of a multiply operation. The high and low
words of PREG can be accessed individually. See subsection 3.2.1 on page 3-7.

3.5.16 Serial Port Interface Registers (SPC, DRR, DXR, XSR, RSR)
Five registers control and operate the serial port interface. The 16-bit serial
port control register (SPC) contains the mode control and status bits of the seri-
al port. The 16-bit data receive register (DRR) holds the incoming serial data,
and the 16-bit data transmit register (DXR) holds the outgoing serial data. The
16-bit data transmit shift register (XSR) controls the shifting of the data from
the DXR to the output pin. The 16-bit data receive shift register (RSR) controls
the storing of the data from the input pin to the DRR. The serial port is de-
scribed in Section 9.7, Serial Port Interface, on page 9-23.

3.5.17 Software-Programmable Wait-State Registers (PDWSR, IOWSR, CWSR)


The software wait states are determined by three registers. These registers
serve different purposes on different devices. On most ’C5x devices the 16-bit
program/data wait-state register (PDWSR) contains the wait-state count for
the eight 16K-word blocks of program and data memory. The PDWSR is di-
vided into eight 2-bit wait-state fields assigned to each 16K-word block. The
I/O space is mapped into the 16-bit I/O wait-state register (IOWSR) under con-
trol of the 5-bit wait-state control register (CWSR). The CWSR determines the
range of wait states selected. The BIG bit in the CWSR determines how the
I/O space is partitioned. If the BIG bit is cleared, the IOWSR is divided into eight
pairs of I/O ports with the 2-bit wait-state fields assigned to each pair of port
addresses. If the BIG bit is set, the I/O space is divided into eight 8K-word
blocks with each having its own 2-bit wait-state field, similar to PDWSR. For
the ’C52, ’LC56, ’C57S, and ’LC57 devices, the program, data, and I/O space
wait states are each specified by a single (3-bit) wait-state value. Each
memory space can be independently set to 0–7 wait states by a 3-bit wait-state
field in PDWSR. See Section 9.4, Software-Programmable Wait-State Gener-
ators, on page 9-13.

3-24
Summary
unit-5 DSP of Registers
Processors

3.5.18 Status Registers (ST0, ST1)


The two 16-bit status registers contain status and control bits for the CPU and
are described in subsection 4.4.3, Status Registers (ST0 and ST1), on page
4-10.

3.5.19 Temporary Registers (TREG0, TREG1, TREG2)


The 16-bit TREG0 holds one of the multiplicands of the multiplier. TREG0 can
also be loaded via the CALU with the following instructions: LT, LTA, LTD, LTP,
LTS, SQRA, SQRS, MAC, MACD, MADS, and MADD. The 5-bit TREG1 holds
a dynamic (execution-time programmable) shift count for the prescaling shift-
er. The 4-bit TREG2 holds a dynamic bit address for the BITT instruction. The
TREG0 is described in subsection 3.2.1 on page 3-7.

Software compatibility can be maintained with the ’C2x by clearing the enable
multiple TREGs (TRM) bit in the PMST. This causes any ’C2x instruction that
loads TREG0 to write to all three TREGs, maintaining ’C5x object-code com-
patibility with the ’C2x.

3.5.20 Timer Registers (TIM, PRD, TCR)


Three registers control and operate the timer. The timer counter register (TIM)
gives the current count of the timer. The timer period register (PRD) defines
the period for the timer. The 16-bit timer control register (TCR) controls the op-
erations of the timer. See Section 9.3, Timer, on page 9-9.

3.5.21 TDM Serial Port Registers (TRCV, TDXR, TSPC, TCSR, TRTA, TRAD, TRSR)
The time-division-multiplexed (TDM) serial port interface is a feature superset
of the serial port interface and supports applications that require serial commu-
nication in a multiprocessing environment. Six registers control and operate
the TDM serial port interface. The 16-bit TDM serial port control register
(TSPC) contains the mode control and status bits of the TDM serial port inter-
face. The 16-bit TDM data receive register (TRCV) holds the incoming TDM
serial data, and the 16-bit TDM data transmit register (TDXR) holds the outgo-
ing TDM serial data. The 16-bit TDM data receive shift register (TRSR) con-
trols the storing of the data, from the input pin, to the TRCV. The 16-bit TDM
channel select register (TCSR) specifies in which time slot(s) each ’C5x device
is to transmit. The 16-bit TDM receive/transmit address register (TRTA) speci-
fies in the eight LSBs (RA0–RA7) the receive address of the ’C5x device and
in the eight MSBs (TA0–TA7) the transmit address of the ’C5x device. The
16-bit TDM receive address register (TRAD) contains various information re-
garding the status of the TDM address line (TADD). See Section 9.9, Time-Di-
vision Multiplexed (TDM) Serial Port Interface, on page 9-74.

Central Processing Unit (CPU) 3-25


unit-5 DSP Processors

Chapter 5

Addressing Modes

This chapter describes each of the following addressing modes and gives the
opcode formats and some examples.

- Direct addressing
- Indirect addressing
- Immediate addressing
- Dedicated-register addressing
- Memory-mapped register addressing
- Circular addressing

Topic Page

5.1 Direct Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2


5.2 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5.3 Immediate Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
5.4 Dedicated-Register Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
5.5 Memory-Mapped Register Addressing . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
5.6 Circular Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21

Addressing Modes 5-1


Direct Addressing unit-5 DSP Processors

5.1 Direct Addressing


In the direct memory addressing mode, the instruction contains the lower 7 bits of
the data memory address (dma). The 7-bit dma is concatenated with the 9 bits of
the data memory page pointer (DP) in status register 0 to form the full 16-bit data
memory address. This 16-bit data memory address is placed on an internal direct
data memory address bus (DAB). The DP points to one of 512 possible data
memory pages and the 7-bit address in the instruction points to one of 128 words
within that data memory page. You can load the DP bits by using the LDP or the
LST #0 instruction.

Figure 5–1 illustrates how the 16-bit data memory address is formed.

Figure 5–1. Direct Addressing

ST0 DP (9) IREG (16)

9 7 LSBs

15 6 0
16-bit data memory address
DP dma

PAGE 511

PAGE 510

512 DATA
PAGES PAGE 3 DAB

PAGE 2

PAGE 1

PAGE 0

(MEMORY-
128-WORD MAPPED
PAGE REGISTERS
AND
DARAM B2)

5-2
unit-5 Direct Addressing
DSP Processors

Note:
The DP is not initialized by reset and, therefore, is undefined after power-up.
The ’C5x development tools, however, use default values for many parameters,
including the DP. Because of this, programs that do not explicitly initialize the
DP may execute improperly, depending on whether they are executed on a
’C5x device or with a development tool. Thus, it is critical that all programs
initialize the DP in software.

Figure 5–2 illustrates the direct addressing mode. Bits 15 through 8 contain
the opcode. Bit 7, with a value of 0, defines the addressing mode as direct, and
bits 6 through 0 contain the dma.

Figure 5–2. Direct Addressing Mode


ADD opcode 010h
LDP #019Dh
ADD 010h, 5
15 8 7 6 0
Machine Code 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0

DP 1 1 0 0 1 1 1 0 1

DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0

Operand Data(DAB)

Note: DAB is the 16-bit internal data memory address bus.

Addressing Modes 5-3


Indirect Addressing unit-5 DSP Processors

5.2 Indirect Addressing


Eight 16-bit auxiliary registers (AR0–AR7) provide flexible and powerful indirect
addressing. In indirect addressing, any location in the 64K-word data memory
space can be accessed using a 16-bit address contained in an AR. Figure 5–3
shows the hardware for indirect addressing.

Figure 5–3. Indirect Addressing


Data bus (16)

3 3 Auxiliary registers
AR0
ARB 3 ARP
3 AR1
(ARP = 2)
AR2
AR3
AR4 16
AR5
AR6
AR7

16
16
ARAU

16-bit data address

To select a specific AR, load the auxiliary register pointer (ARP) with a value
from 0 through 7, designating AR0 through AR7, respectively. The register
pointed to by the ARP is referred to as the current auxiliary register (current
AR). You can load the address into the AR using the LAR instruction and you
can change the content of the AR by the:

- ADRK instruction
- MAR instruction
- SBRK instruction
- Indirect addressing field of any instruction supporting indirect addressing.

The content of the current AR is used as the address of the data memory oper-
and. After the instruction uses the data value, the content of the current AR can
be incremented or decremented by the auxiliary register arithmetic unit
(ARAU), which implements unsigned 16-bit arithmetic.

5-4
unit-5Indirect Addressing
DSP Processors

The ARAU performs auxiliary register arithmetic operations in the decode


phase of the pipeline (when the instruction specifying the operation is being
decoded). This allows the address to be generated before the decode phase
of the next instruction. The content of the current AR is incremented or decrem-
ented after it is used in the current instruction.
You can load the ARs via the data bus by using memory-mapped writes to the
ARs. The following instructions can write to the memory-mapped ARs:

APL OPL SAMM XPL


BLDD SACH SMMR
LMMR SACL SPLK

Be careful when using these memory-mapped loads of the ARs because, in


this case, the memory-mapped ARs are modified in the execute phase of the
pipeline. This causes a pipeline conflict if one of the next two instruction words
modifies that AR. For further information on the pipeline and possible pipeline
conflicts, see Chapter 7, Pipeline.
There are two ways to use the ARs for purposes other than referencing data
memory addresses:
- Use the ARs to support conditional branches, calls, and returns by using
the CMPR instruction. This instruction compares the content of the current
AR with the content of the auxiliary register compare register (ARCR) and
puts the result in the test/control (TC) flag bit of status register ST1.
- Use the ARs for temporary storage by using the LAR instruction to load
a value into the AR and the SAR instruction to store the AR value to a data
memory location.

5.2.1 Indirect Addressing Options


The ’C5x provides four indirect addressing options:
- No increment or decrement. The instruction uses the content of the current
AR as the data memory address, but neither increments nor decrements the
content of the current AR.
- Increment or decrement by one. The instruction uses the content of the
current AR as the data memory address and then increments or decrements
the content of the current AR by 1.
- Increment or decrement by an index amount. The value in INDX is the
index amount. The instruction uses the content of the current AR as the
data memory address and then increments or decrements the content of
the current AR by the index amount.

Addressing Modes 5-5


Indirect Addressing unit-5 DSP Processors

- Increment or decrement by an index amount using reverse carry. The


value in INDX is the index amount. The instruction uses the content of the
current AR as the data memory address and then increments or decrements
the content of the current AR by the index amount. The addition or subtrac-
tion is done using reverse carry propagation.

The contents of the current AR are used as the address of the data memory
operand. Then, the ARAU performs the specified mathematical operation on
the indicated AR. Additionally, the ARP can be loaded with a new value. All
indexing operations are performed on the current AR in the same cycle as the
original instruction decode phase of the pipeline.

Indirect auxiliary register addressing lets you make post-access adjustments


of the current AR. The adjustment may be an increment or decrement by one
or may be based upon the contents of the INDX. To maintain compatibility with
the ’C2x devices, clear the NDX bit in the PMST. In the ’C2x architecture, the
current AR can be incremented or decremented by the value in the AR0. When
the NDX bit is cleared, every AR0 modification or LAR write also writes the
ARCR and INDX with the same value. Subsequent modifications of the current
ARs with indexed addressing will use the INDX, therefore maintaining compatibility
with existing ’C2x code. The NDX bit is cleared at reset.

The bit-reversed addressing modes (see subsection 5.2.3 on page 5-12) helps
you achieve efficient I/O by the resequencing of data points in a radix-2 fast
Fourier transform (FFT) program. The direction of carry propagation in the
ARAU is reversed when bit-reversed addressing is selected, and INDX is added
to/subtracted from the current AR. Normally, this addressing mode requires that
INDX first be set to a value corresponding to one-half of the array’s size, and
that the current AR be set to the base address of the data (the first data point).

The following indirect-addressing symbols are used in the ’C5x assembly language
instructions:

* No increment or decrement. Content of the current AR is used


as the data memory address and is neither incremented nor
decremented.
*+ Increment by 1. Content of the current AR is used as the data
memory address. After the memory access, the content of the current
AR is incremented by 1.
*– Decrement by 1. Content of current AR is used as the data memory
address. After the memory access, the content of the current AR is
decremented by 1.
*0+ Increment by index amount. Content of current AR is used as the
data memory address. After the memory access, the content of
INDX is added to the content of the current AR.

5-6
unit-5Indirect Addressing
DSP Processors

*0– Decrement by index amount. Content of current AR is used as


the data memory address. After the memory access, the content
of INDX is subtracted from the content of the current AR.
*BR0+ Increment by index amount, adding with reverse carry. Content
of current AR is used as the data memory address. After the memory
access, the content of INDX with reverse carry propagation is added
to the content of the current AR.
*BR0– Decrement by index amount, subtracting with reverse carry.
Content of current AR is used as the data memory address. After the
memory access, the content of INDX with reverse carry propagation
is subtracted from the content of the current AR.

5.2.2 Indirect Addressing Opcode Format


Indirect addressing can be used with all instructions except those with immediate
operands or with no operands. The indirect addressing format is shown in
Figure 5–4 and described in Table 5–1.

Table 5–3 on page 5-9 shows the instruction field bit values, notation, and op-
eration used for indirect addressing. Example 5–1 through Example 5–8 illus-
trate the indirect addressing formats. Example 5–9 shows an indirect address-
ing routine.

ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
ÁÁÁÁÁÁÁ
Figure 5–4. Indirect Addressing Opcode Format Diagram

ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
15–8
ÁÁÁÁÁÁÁ 7 6 5 4 3 2–0

ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
Opcode
ÁÁÁÁÁÁÁ I IDV INC DEC N NAR

Table 5–1. Indirect Addressing Opcode Format Summary


Bit Name Description
15–8 Opcode. This 8-bit field is the opcode for the instruction.

7 I Addressing mode bit. This 1-bit field determines the addressing mode.

I=0 Direct addressing mode.

I=1 Indirect addressing mode.

Addressing Modes 5-7


Indirect Addressing unit-5 DSP Processors

Table 5–1. Indirect Addressing Opcode Format Summary (Continued)


Bit Name Description
6 IDV Index register bit. This 1-bit field determines whether the INDX is used to increment or
decrement the current AR. The IDV bit works in conjunction with the INC and DEC bits to
determine the arithmetic operation.

IDV = 0 The INDX is not used in the arithmetic operation. An increment or decrement
(if any) by 1 occurs to the current AR.

IDV = 1 The INDX is used in the arithmetic operation. An increment or decrement (if
any) by the contents of INDX or by reverse carry propagation occurs to the
current AR.

5 INC Auxiliary register increment bit. This 1-bit field determines whether the current AR is in-
cremented. The INC bit works in conjunction with the IDV and DEC bits to determine the
arithmetic operation.

INC = 0 The current AR is not incremented.

INC = 1 The current AR is incremented as determined by the IDV bit.

4 DEC Auxiliary register decrement bit. This 1-bit field determines whether the current AR is de-
cremented. The DEC bit works in conjunction with the IDV and INC bits to determine the
arithmetic operation. See Table 5–2 for specific arithmetic operations.

DEC = 0 The current AR is not decremented.

DEC = 1 The current AR is decremented as determined by the IDV bit.

3 N Next auxiliary register indicator bit. This 1-bit field determines whether the instruction will
change the ARP value.

N=0 The content of the ARP will remain unchanged.

N=1 The content of NAR will be loaded into the ARP, and the old ARP value is
loaded into the auxiliary register buffer (ARB) of status register ST1.

2–0 NAR Next auxiliary register value bits. This 3-bit field contains the value of the next auxiliary
register. If the N bit is set, NAR is loaded into the ARP.

5-8
unit-5Indirect Addressing
DSP Processors

Table 5–2. Indirect Addressing Arithmetic Operations

Bit values
IDV INC DEC Arithmetic Operation Performed on Current AR
0 0 0 No operation on current AR
0 0 1 (Current AR) – 1 → current AR
0 1 0 (Current AR) + 1 → current AR
0 1 1 Reserved
1 0 0 (Current AR) – INDX [reverse carry propagation] → current AR
1 0 1 (Current AR) – INDX → current AR
1 1 0 (Current AR) + INDX → current AR
1 1 1 (Current AR) + INDX [reverse carry propagation] → current AR

Table 5–3. Instruction Field Bit Values for Indirect Addressing


Instruction Field Bit Values
15–8 7 6 5 4 3 2–0 Notation Operation
← Opcode → 1 0 0 0 0 ← NAR → * No operation on current AR

← Opcode → 1 0 0 0 1 ← NAR → *, ARn NAR → ARP

← Opcode → 1 0 0 1 0 ← NAR → *– (Current AR) – 1 → current AR

← Opcode → 1 0 0 1 1 ← NAR → *–, ARn (Current AR) – 1 → current AR,


NAR → ARP

← Opcode → 1 0 1 0 0 ← NAR → *+ (Current AR) + 1 → current AR

← Opcode → 1 0 1 0 1 ← NAR → *+, ARn (Current AR) + 1 → current AR,


NAR → ARP

← Opcode → 1 1 0 0 0 ← NAR → *BR0– (Current AR) – rcINDX → current AR

← Opcode → 1 1 0 0 1 ← NAR → *BR0–, ARn (Current AR) – rcINDX → current AR,


NAR → ARP

← Opcode → 1 1 0 1 0 ← NAR → *0– (Current AR) – INDX → current AR

← Opcode → 1 1 0 1 1 ← NAR → *0–, ARn (Current AR) – INDX → current AR,


NAR → ARP

← Opcode → 1 1 1 0 0 ← NAR → *0+ (Current AR) + INDX → current AR

← Opcode → 1 1 1 0 1 ← NAR → *0+, ARn (Current AR) + INDX → current AR,


NAR → ARP

Addressing Modes 5-9


Indirect Addressing unit-5 DSP Processors

Table 5–3. Instruction Field Bit Values for Indirect Addressing (Continued)
Instruction Field Bit Values
15–8 7 6 5 4 3 2–0 Notation Operation
← Opcode → 1 1 1 1 0 ← NAR → *BR0+ (Current AR) + rcINDX → current AR

← Opcode → 1 1 1 1 1 ← NAR → *BR0+, ARn (Current AR) + rcINDX → current AR,


NAR → ARP

Example 5–1. Indirect Addressing With No Change to AR

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *,8

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 0 0 0 0 0 0

In Example 5–1, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is not changed. The instruction word is 2880h.

Example 5–2. Indirect Addressing With Autodecrement

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *–,8

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 0 1 0 0

In Example 5–2, the content of the data memory address, defined by the con-
0 0

tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is decremented by 1. The instruction word is 2890h.

Example 5–3. Indirect Addressing With Autoincrement

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *+,8

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 1 0 0 0

In Example 5–3, the content of the data memory address, defined by the con-
0 0

tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is incremented by 1. The instruction word is 28A0h.

5-10
unit-5Indirect Addressing
DSP Processors

Example 5–4. Indirect Addressing With Autoincrement and Change AR

ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ADD *+,8,AR3

ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
1
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
1 0 0 0 1 0 1 0 1 0 1 1

In Example 5–4, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is incremented by 1. The auxiliary register pointer (ARP) is loaded with the
value 3 for subsequent instructions. The instruction word is 28ABh.

Example 5–5. Indirect Addressing With INDX Subtracted from AR

ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ADD *0 –,8

ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
1
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
1 0 0 0 1 1 0 1 0 0 0

In Example 5–5, the content of the data memory address, defined by the con-
0

tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX is subtracted from the current AR. The instruction word is 28D0h.

Example 5–6. Indirect Addressing With INDX Added to AR


ADD *0+,8

ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0

In Example 5–6, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX is added to the current AR. The instruction word is 28E0h.

Example 5–7. Indirect Addressing With INDX Subtracted from AR With Reverse Carry

ÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁ
ADD *BR0 –,8

ÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁ
ÁÁ ÁÁÁÁÁ
ÁÁ ÁÁÁ
0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0

In Example 5–7, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX with reverse carry propagation is subtracted from the current AR. The
instruction word is 28C0h.

Addressing Modes 5-11


Indirect Addressing unit-5 DSP Processors

Example 5–8. Indirect Addressing With INDX Added to AR With Reverse Carry

ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *BR0+,8

ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
1 0 0 0 1 1 1 1 0 0 0

In Example 5–8, the content of the data memory address, defined by the con-
0

tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX with reverse carry propagation is added to the current AR. The instruc-
tion word is 28F0h.

Example 5–9. Indirect Addressing Routine

* This routine uses indirect addressing to calculate the following equation:


*
* 10
* –––––
* \ X(I) x Y(I)
* /
* –––––
* I = 1
*
* The routine assumes that the X values are located in on-chip RAM block B0,
* and the Y values in block B1. The efficiency of the routine is due to the
* use of indirect addressing and the repeat instruction.
*
SERIES MAR *,AR4 ;ARP POINTS TO ADDRESS REGISTER 4.
SETC CNF ;CONFIGURE BLOCK B0 AS PROGRAM MEMORY.
LAR AR4,#0300h ;POINT AT BEGINNING OF DATA MEMORY.
RPTZ #9 ;CLEAR ACC AND PREG; REPEAT NEXT INST. 10 TIMES
MAC 0FF00h,*+ ;MULTIPLY AND ACCUMULATE; INCREMENT AR4.
APAC ;ACCUMULATE LAST PRODUCT.
RET ;ACCUMULATOR CONTAINS RESULT.

5.2.3 Bit-Reversed Addressing


In the bit-reversed addressing mode, INDX specifies one-half the size of the
FFT. The value contained in the current AR must be equal to 2n–1, where n is
an integer, and the FFT size is 2n. An auxiliary register points to the physical
location of a data value. When you add INDX to the current AR using bit-
reversed addressing, addresses are generated in a bit-reversed fashion.

Assume that the auxiliary registers are eight bits long, that AR2 represents the
base address of the data in memory (0110 00002), and that INDX contains the
value 0000 10002. Example 5–10 shows a sequence of modifications to AR2
and the resulting values of AR2. Table 5–4 shows the relationship of the bit pat-
tern of the index steps and the four LSBs of AR2, which contain the bit-
reversed address.

5-12
unit-5Indirect Addressing
DSP Processors

Example 5–10. Sequence of Auxiliary Register Modifications in Bit-Reversed Addressing

*BR0+ ;AR2 = 0110 0000 (0th value)


*BR0+ ,AR2 = 0110 1000 (1st value)
*BR0+ ;AR2 = 0110 0100 (2nd value)
*BR0+ ;AR2 = 0110 1100 (3rd value)
*BR0+ ;AR2 = 0110 0010 (4th value)
*BR0+ ;AR2 = 0110 1010 (5th value)
*BR0+ ;AR2 = 0110 0110 (6th value)
*BR0+ ;AR2 = 0110 1110 (7th value)

Table 5–4. Bit-Reversed Addresses


Step Bit Pattern Bit-Reversed Pattern Bit-Reversed Step
0 0000 0000 0
1 0001 1000 8
2 0010 0100 4
3 0011 1100 12
4 0100 0010 2
5 0101 1010 10
6 0110 0110 6
7 0111 1110 14
8 1000 0001 1
9 1001 1001 9
10 1010 0101 5
11 1011 1101 13
12 1100 0011 3
13 1101 1011 11
14 1110 0111 7
15 1111 1111 15

Addressing Modes 5-13


Immediate Addressing unit-5 DSP Processors

5.3 Immediate Addressing


In immediate addressing, the instruction word(s) contains the value of the im-
mediate operand. The ’C5x has both 1-word (8-bit, 9-bit, and 13-bit constant)
short immediate instructions and 2-word (16-bit constant) long immediate
instructions. Table 5–5 lists the instructions that support immediate addressing.

Table 5–5. Instructions That Support Immediate Addressing


Short Immediate (1-Word) Long Immediate (2-Word)
8-Bit 9-Bit 13-Bit 16-Bit
Constant Constant Constant Constant
ADD LDP MPY ADD OR
ADRK AND RPT
LACL APL RPTZ
LAR CPL SPLK
RPT LACC SUB
SBRK LAR XOR
SUB MPY XPL
OPL

5.3.1 Short Immediate Addressing


In short immediate instructions, the operand is contained within the instruction
machine code. Figure 5–5 shows an example of the short immediate mode.
Note that in this example, the lower 8 bits are the operand and will be added
to the ACC by the CALU.

Figure 5–5. Short Immediate Addressing Mode

ADD opcode 0FFh


ADD #0FFh

Machine Code 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1

Operand 1 1 1 1 1 1 1 1

5-14
Immediate
unit-5 Addressing
DSP Processors

5.3.2 Long Immediate Addressing

In long immediate instructions, the operand is contained in the second word


of a two-word instruction. There are two long immediate addressing modes:

- One-operand instructions
- Two-operand instructions

5.3.2.1 Long Immediate Addressing with Single/No Data Memory Access

Figure 5–6 shows an example of long immediate addressing with no data


memory access. In Figure 5–6, the second word of the 2-word instruction is
added to the ACC by the CALU.

Figure 5–6. Long Immediate Addressing Mode — No Data Memory Access


ADD opcode
ADD #01234h

Machine Code 1 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0
Operand 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0

01234h

5.3.2.2 Long Immediate Addressing with Dual Data Memory Access

The long immediate addressing also could apply for a second data memory
access for the execution of the instruction. The prefetch counter (PFC) is
pushed onto the microcall stack (MCS), and the long immediate value is loaded
into the PFC. The program address/data bus is then used for the operand fetch
or write. At the completion of the instruction, the MCS is popped back to the PFC,
the program counter (PC) is incremented by two, and execution continues. The
PFC is used so that when the instruction is repeated, the address generated can
be autoincremented.

Figure 5–7 shows an example of long immediate addressing with two oper-
ands. In Figure 5–7, the source address (OPERAND1) is fetched via PAB, and
the destination address (OPERAND2) uses the direct addressing mode. Bits
15 through 8 of machine code1 contain the opcode. Bit 7, with a value of 0,
defines the addressing mode as direct, and bits 6 through 0 contain the dma.

Addressing Modes 5-15


Immediate Addressing unit-5 DSP Processors

Figure 5–7. Long Immediate Addressing Mode — Two Operands


BLDD opcode 012h
BLDD #02345h,012h
15 8 7 6 0
Machine Code1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0

DP 1 1 0 0 1 1 1 0 1

DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 0

02345h

Machine Code2 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1

PC 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1

Operand1 Data (PC)


Operand2 Data (DAB)

Note: DAB is the 16-bit internal data memory address bus.

5-16
Dedicated-Register Addressing
unit-5 DSP Processors

5.4 Dedicated-Register Addressing


The dedicated-registered addressing mode operates like the long immediate
addressing mode, except that the address comes from one of two
special-purpose memory-mapped registers in the CPU: the block move
address register (BMAR) and the dynamic bit manipulation register (DBMR).
The advantage of this addressing mode is that the address of the block of
memory to be acted upon can be changed during execution of the program.
The syntax for dedicated-register addressing can be stated in one of two ways:

- Specify BMAR by its predefined symbol:

BLDD BMAR,DAT100 ;DP = 0. BMAR contains the value 200h.


The content of data memory location 200h is copied to data memory loca-
tion 100 on the current data page.

- Exclude the immediate value from a parallel logic unit (PLU) instruction:

OPL DAT10 ;DP = 6. DBMR contains the value FFF0h.


;Address 030Ah contains the value 01h
The content of data memory location 030Ah is ORed with the content of
the DBMR. The resulting value FFF1h is stored back in memory location
030Ah.

5.4.1 Using the Contents of the BMAR


The BLDD, BLDP, and BLPD instructions use the BMAR to point at the source
or destination space of a block move. The MADD and MADS instructions also
use the BMAR to address an operand in program memory for a multiply-
accumulate operation.

Figure 5–8 shows how the BMAR is used in the dedicated-register addressing
mode. Bits 15 through 8 of the machine code contain the opcode. Bit 7, with
a value of 0, defines the addressing mode as direct, and bits 6 through 0 con-
tain the dma.

Addressing Modes 5-17


Dedicated-Register Addressing unit-5 DSP Processors

Figure 5–8. Dedicated-Register Addressing Using the BMAR


BLDD opcode 012h
BLDD BMAR, 012h
15 8 7 6 0
Machine Code 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0

DP 1 1 0 0 1 1 1 0 1

DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 0

BMAR PFC

Operand1 Data (PFC)


Operand2 Data (DAB)
Note: DAB is the 16-bit internal data memory address bus.

5.4.2 Using the Contents of the DBMR


The APL, CPL, OPL, and XPL instructions use the PLU and the contents of the
DBMR when an immediate value is not specified as one of the operands.
Figure 5–9 illustrates how the DBMR is used as an AND mask in the APL
instruction. Bits 15 through 8 of the machine code contain the opcode. Bit 7,
with a value of 0, defines the addressing mode as direct, and bits 6 through
0 contain the dma.

Figure 5–9. Dedicated-Register Addressing Using the DBMR


APL opcode 010h

APL 010h
15 8 7 6 0

Machine Code 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0

DP 1 1 0 0 1 1 1 0 1

DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0

Operand1 Data(DAB)
Operand2 DBMR
Note: DAB is the 16-bit internal data memory address bus.

5-18
Memory-Mapped Register
unit-5 Addressing
DSP Processors

5.5 Memory-Mapped Register Addressing


With memory-mapped register addressing, you can modify the memory-
mapped registers without affecting the current data page pointer value. In
addition, you can modify any scratch pad RAM (DARAM B2) location or data
page 0. The memory-mapped register addressing mode operates like the
direct addressing mode, except that the 9 MSBs of the address are forced to
0 instead of being loaded with the contents of the DP. This allows you to
address the memory-mapped registers of data page 0 directly without the
overhead of changing the DP or auxiliary register.

The following instructions operate in the memory-mapped register addressing


mode. Using these instructions does not affect the contents of the DP:

- LAMM — Load accumulator with memory-mapped register


- LMMR — Load memory-mapped register
- SAMM — Store accumulator in memory-mapped register
- SMMR — Store memory-mapped register

Figure 5–10 illustrates how this is done by forcing the 9 MSBs of the data
memory address to 0, regardless of the current value of the DP when direct
addressing is used or of the current AR value when indirect addressing is used.

Example 5–11 uses memory-mapped register addressing in the direct


addressing mode and Example 5–12 uses the indirect addressing mode.

Figure 5–10. Memory-Mapped Register Addressing


7 LSBs from IREG (direct addressing)
or current AR (indirect addressing)

7 LSBs

15 6 0 16-bit memory-mapped
register address
0 0 0 0 0 0 0 0 0 dma

PAGE 0
DAB

(MEMORY-
128-WORD MAPPED
PAGE REGISTERS
AND
DARAM B2)

Addressing Modes 5-19


Memory-Mapped Register Addressing unit-5 DSP Processors

Example 5–11. Memory-Mapped Register Addressing in the Indirect Addressing Mode


SAMM *+ ;STORE ACC TO PMST REGISTER

In Example 5–11, assume that ARP = 3 and AR3 = FF07h. The content of the
ACC is stored to the PMST (address 07h) pointed at by the 7 LSBs of AR3.

Example 5–12. Memory-Mapped Register Addressing in the Direct Addressing Mode


LAMM 07h ;ACC = PMST

In Example 5–12, assume that DP = 0184h and TEMP1 = 8060h. The content
of memory location 07h (PMST) is loaded into the ACC. Figure 5–11 illustrates
memory-mapped register addressing in the direct addressing mode.

Figure 5–11.Memory-Mapped Addressing in the Direct Addressing Mode

LAMM opcode 07h


LAMM PMST
15 8 7 6 0
Machine Code 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1

Value 0 0 0 0 0 0 0 0 0

DAB 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1

Operand Data(DAB)

Note: DAB is the 16-bit internal data memory address bus.

5-20
Circular
unit-5 Addressing
DSP Processors

5.6 Circular Addressing


Many algorithms such as convolution, correlation, and finite impulse response
(FIR) filters can use circular buffers in memory to implement a sliding window,
which contains the most recent data to be processed. The ’C5x supports two
concurrent circular buffers operating via the ARs. The following five
memory-mapped registers control the circular buffer operation:

- CBSR1 — Circular buffer 1 start register


- CBSR2 — Circular buffer 2 start register
- CBER1 — Circular buffer 1 end register
- CBER2 — Circular buffer 2 end register
- CBCR — Circular buffer control register

The 8-bit CBCR enables and disables the circular buffer operation and is
defined in subsection 4.4.1, Circular Buffer Control Register (CBCR), on
page 4-6.

To define circular buffers, you first load the start and end addresses into the
corresponding buffer registers; next, load a value between the start and end
registers for the circular buffer into an AR. Load the proper AR value, and set
the corresponding circular buffer enable bit in the CBCR. Note that you must
not enable the same AR for both circular buffers; if you do, unexpected results
occur. The algorithm for circular buffer addressing below shows that the test
of the AR value is performed before any modifications:

If (ARn = CBER) and (any AR modification),


Then: ARn = CBSR.
Else: ARn = ARn + step.

If ARn = CBER and no AR modification occurs, the current AR is not modified


and is still equal to CBER. When the current AR = CBER, any AR modification
(increment or decrement) will set the current AR = CBSR. Example 5–13 illus-
trates the operation of circular addressing.

Addressing Modes 5-21


Circular Addressing unit-5 DSP Processors

Example 5–13. Circular Addressing

mar *,ar6
lpd #,0
splk #200h,CBSR1 ; Circular buffer start register
splk #203h,CBER1 ; Circular buffer end register
splk #0Eh,CBCR ; Enable AR6 pointing to buffer 1
lar ar6,#200h ; Case 1
lacc * ; AR6 = 200h
lar ar6,#203h ; Case 2
lacc * ; AR6 = 203h
lar ar6,#200h ; Case 3
lacc *+ ; AR6 = 201h
lar ar6,#203h ; Case 4
lacc *+ ; AR6 = 200h
lar ar6,#200h ; Case 5
lacc *– ; AR6 = 1FFh
lar ar6,#203h ; Case 6
lacc *– ; AR6 = 200h
lar ar6,#202h ; Case 7
adrk 2 ; AR6 = 204h
lar ar6,#203h ; Case 8
adrk 2 ; AR6 = 200h

In circular addressing, the step is the quantity that is being added to or sub-
tracted from the specified AR. Take care when using a step of greater than 1
to modify the AR pointing to an element of the circular buffer. If an update to
an AR generates an address outside the range of the circular buffer, the ARAU
does not detect this situation, and the buffer does not wrap around. AR up-
dates are performed as described in Section 5.2, Indirect Addressing.
Because of the pipeline, there is a two-cycle latency between configuring the
CBCR and performing AR modifications.

Circular buffers can be used in increment- or decrement-type updates. For


incrementing the value in the AR, the value in CBER must be greater than the
value in CBSR. For decrementing the value in the AR, the value in CBSR must
be greater than the value in CBER.

5-22
unit-5 DSP Processors
DSP Applications Using C and the TMS320C6x DSK. Rulph Chassaing
Copyright © 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-20754-3 (Hardback); 0-471-22112-0 (Electronic)

3
Architecture and Instruction Set
of the C6x Processor

• Architecture and instruction set of the TMS320C6x processor


• Addressing modes
• Assembler directives
• Linear assembler
• Programming examples using C, assembly, and linear assembly code

3.1 INTRODUCTION

Texas Instruments introduced the first-generation TMS32010 digital signal proces-


sor in 1982, the TMS320C25 in 1986 [1], and the TMS320C50 in 1991. Several ver-
sions of each of these processors—C1x, C2x, and C5x—are available with different
features, such as faster execution speed. These 16-bit processors are all fixed-point
processors and are code-compatible.
In a von Neumann architecture, program instructions and data are stored in a
single memory space. A processor with a von Neumann architecture can make a
read or a write to memory during each instruction cycle. Typical DSP applications
require several accesses to memory within one instruction cycle. The fixed-point
processors C1x, C2x, and C5x are based on a modified Harvard architecture with
separate memory spaces for data and instructions that allow concurrent accesses.
Quantization error or round-off noise from an ADC is a concern with a fixed-
point processor. An ADC uses only a best-estimate digital value to represent an
input. For example, consider an ADC with a word length of 8 bits and an input range
of ±1.5 V. The steps represented by the ADC are: input range/28 = 3/256 = 11.72 mV.
This produces errors which can be up to ±(11.72 mV)/2 = ±5.86 mV. Only a best esti-
mate can be used by the ADC to represent input values that are not multiples of
61
unit-5 DSP Processors

62 Architecture and Instruction Set of the C6x Processor

11.72 mV.With an 8-bit ADC, 28 or 256 different levels can represent the input signal.
An ADC with a larger word length such as a 16-bit ADC (currently very common)
can reduce the quantization error, yielding a higher resolution. The more bits that
an ADC has, the better it can represent an input signal.
The TMS320C30 floating-point processor was introduced in the late 1980s. The
C31, C32, and the more recent C33 are all members of the C3x family of floating-
point processors [2,3]. The C4x floating-point processors, introduced subsequently,
are code-compatible with the C3x processors and are based on the modified
Harvard architecture [4].
The TMS320C6201 (C62x), announced in 1997, is the first member of the C6x
family of fixed-point digital signal processors. Unlike the previous fixed-point
processors, C1x, C2x, and C5x, the C62x is based on a very-long-instruction-word
(VLIW) architecture, still using separate memory spaces for instructions and data
as with the Harvard architecture. The VLIW architecture has simpler instructions,
but more are needed for a task than with a conventional DSP architecture.
The C62x is not code-compatible with the previous generation of fixed-point
processors. Subsequently, the TMS320C6701 (C67x) floating-point processor was
introduced as another member of the C6x family of processors. The instruction
set of the C62x fixed-point processor is a subset of the instruction set of the
C67x processor. Appendix A contains a list of instructions available on the C6x
processors. A recent addition to the family of the C6x processors is the fixed-point
C64x.
An application-specific integrated circuit (ASIC) has a DSP core with customized
circuitry for a specific application. A C6x processor can be used as a standard
general-purpose digital signal processor programmed for a specific application.
Specific-purpose digital signal processors are the modem, echo canceler, and others.
A fixed-point processor is better for devices that use batteries, such as cellular
phones, since it uses less power than does an equivalent floating-point processor.
The fixed-point processors, C1x, C2x, and C5x are 16-bit processors with limited
dynamic range and precision. The C6x fixed-point processor is a 32-bit processor
with improved dynamic range and precision. In a fixed-point processor, it is neces-
sary to scale the data. Overflow, which occurs when an operation such as the addi-
tion of two numbers produces a result with more bits than can fit within a processor’s
register, becomes a concern.
A floating-point processor is generally more expensive since it has more “real
estate” or is a larger chip because of additional circuitry necessary to handle integer
as well as floating-point arithmetic. Several factors, such as cost, power consump-
tion, and speed, come into play when choosing a specific digital signal processor.
The C6x processors are particularly useful for applications requiring intensive com-
putations. Family members of the C6x include both fixed-point (e.g., C62x, C64x)
and floating-point processors (e.g., C67x). Other digital signal processors are also
available, from companies such as Motorola and Analog Devices [5].
Other architectures include the Super Scalar, which requires special hardware to
determine which instructions are executed in parallel. The burden is then on the
unit-5 DSP Processors

TMS320C6x Architecture 63

processor more than on the programmer as in the VLIW architecture. It does not
execute necessarily the same group of instructions, and as a result, it is difficult to
time. Thus, it is rarely used in DSP.

3.2 TMS320C6x ARCHITECTURE

The TMS320C6711 onboard the DSK is a floating-point processor based on the


VLIW architecture [6–9]. Internal memory includes a two-level cache architecture
with 4 kB of level 1 program cache (L1P), 4 kB of level 1 data cache (L1D), and
64 kB of RAM or level 2 cache for data/program allocation (L2). It has a glueless
(direct) interface to both synchronous memories (SDRAM and SBSRAM) and
asynchronous memories (SRAM and EPROM). Synchronous memory requires
clocking but provides a compromise between static SRAM and dynamic SDRAM,
with SRAM being faster but more expensive than DRAM.
On-chip peripherals include two multichannel buffered serial ports (McBSPs),
two timers, a 16-bit host port interface (HPI), and a 32-bit external memory
interface (EMIF). It requires 3.3 V for I/O and 1.8 V for the core (internal).
Internal buses include a 32-bit program address bus, a 256-bit program data bus to
accommodate eight 32-bit instructions, two 32-bit data address buses, two 64-bit data
buses, and two 64-bit store data buses. With a 32-bit address bus, the total memory
space is 232 = 4 GB, including four external memory spaces: CE0, CE1, CE2, and
CE3. Figure 3.1 shows a functional block diagram of the C6711 processor included
with CCS.
Independent memory banks on the C6x allow for two memory accesses within
one instruction cycle. Two independent memory banks can be accessed using two

FIGURE 3.1. Functional block diagram of TMS320C6x (Courtesy of Texas Instruments).


unit-5 DSP Processors

64 Architecture and Instruction Set of the C6x Processor

independent buses. Since internal memory is organized into memory banks, two
loads or two stores instructions can be performed in parallel. No conflict results if
the data accessed are in different memory banks. Separate buses for program, data,
and direct memory access (DMA) allow the C6x to perform concurrent program
fetches, data read and write, and DMA operations. With data and instructions
residing in separate memory spaces, concurrent memory accesses are possible. The
C6x has a byte-addressable memory space. Internal memory is organized as sepa-
rate program and data memory spaces, with two 32-bit internal ports (two 64-bit
ports with the C64x) to access internal memory.
The C6711 on the DSK includes 72 kB of internal memory, which starts at
0x00000000, and 16 MB of external SDRAM, mapped through CE0 starting at
0x80000000. The DSK also includes 128 kB of Flash memory onboard, starting at
0x90000000. A two-level internal memory block diagram is shown in Figure 3.2,
included with CCS [7]. Table 3.1 shows the memory map. A schematic diagram of
the DSK is included with CCS (C6711dsk_schematics.pdf).
With a clock of 150 MHz onboard the DSK, one can ideally achieve two multi-
plies and accumulates per cycle, for a total of 300 million multiplies and accumu-

FIGURE 3.2. Internal memory block diagram (Courtesy of Texas Instruments).


unit-5 DSP Processors

Functional Units 65

TABLE 3.1 Memory Map Summary


Address Range (Hex) Size (Bytes) Description of Memory Block

0000 0000—0000 FFFF 64K Internal RAM (L2)


0001 0000—017F FFFF 24M–64K Reserved
0180 0000—0183 FFFF 256K Internal configuration bus EMIF registers
0184 0000—0187 FFFF 256K Internal configuration bus L2 control registers
0188 0000—018B FFFF 256K Internal configuration bus HPI register
018C 0000—018F FFFF 256K Internal configuration bus McBSP 0 registers
0190 0000—0193 FFFF 256K Internal configuration bus McBSP 1 registers
0194 0000—0197 FFFF 256K Internal configuration bus timer 0 registers
0198 0000—019B FFFF 256K Internal configuration bus timer 1 registers
019C 0000—019F FFFF 256K Internal configuration bus interrupt selector registers
01A0 0000—01A3 FFFF 256K Internal configuration bus EDMA RAM and registers
01A4 0000—01FF FFFF 6M–256K Reserved
0200 0000—0200 0033 52 QDMA registers
0200 0034—2FFF FFFF 736M–52 Reserved
3000 0000—3FFF FFFF 256M McBSP 0/1 data
4000 0000—7FFF FFFF 1G Reserved
8000 0000—8FFF FFFF 256M External memory interface CE0
9000 0000—9FFF FFFF 256M External memory interface CE1
A000 0000—AFFF FFFF 256M External memory interface CE2
B000 000—BFFF FFFF 256M External memory interface CE3
C000 0000—FFFF FFFF 1G Reserved

Source: Courtesy of Texas Instruments [7].

lates (MACs) per second. With six of the eight functional units in Figure 3.1 (not
the .D units described below) capable of handling floating-point operations, it is
possible to perform 900 million floating-point operations per second (MFLOPS).
Operating at 150 MHz, this translates to 1200 million instructions per second (MIPS)
with a 6.67-ns instruction cycle time.

3.3 FUNCTIONAL UNITS

The CPU consists of eight independent functional units divided into two data paths
A and B, as shown in Figure 3.1. Each path has a unit for multiply operations (.M),
for logical and arithmetic operations (.L), for branch, bit manipulation, and
arithmetic operations (.S), and for loading/storing and arithmetic operations (.D).
The .S and .L units are for arithmetic, logical, and branch instructions. All data
transfers make use of the .D units.
The arithmetic operations, such as subtract or add (SUB or ADD), can be per-
formed by all the units except the .M units (one from each data path). The eight
functional units consist of four floating/fixed-point ALUs (two .L and two .S), two
fixed-point ALUs (.D units), and two floating/fixed-point multipliers (.M units).
Each functional unit can read directly from or write directly to the register file
unit-5 DSP Processors

66 Architecture and Instruction Set of the C6x Processor

within its own path. Each path includes a set of sixteen 32-bit registers, A0 through
A15 and B0 through B15. Units ending in 1 write to register file A, and units ending
in 2 write to register file B.
Two cross-paths (1x and 2x) allow functional units from one data path to access
a 32-bit operand from the register file on the opposite side. There can be a maximum
of two cross-path source reads per cycle. Each functional unit side can access data
from the registers on the opposite side using a cross-path (i.e., the functional units
on one side can access the register set from the other side). There are 32 general-
purpose registers, but some of them are reserved for specific addressing or are used
for conditional instructions.

3.4 FETCH AND EXECUTE PACKETS

The architecture VELOCITI, introduced by TI, is derived from the VLIW archi-
tecture. An execute packet (EP) consists of a group of instructions that can be
executed in parallel within the same cycle time. The number of EPs within a fetch
packet (FP) can vary from one (with eight parallel instructions) to eight (with no
parallel instructions). The VLIW architecture was modified to allow more than one
EP to be included within an EP.
The least significant bit of every 32-bit instruction is used to determine if the next
or subsequent instruction belongs in the same EP (if 1) or is part of the next EP
(if 0). Consider an FP with three EPs: EP1, with two parallel instructions, and EP2
and EP3, each with three parallel instructions, as follows:

Instruction A
|| Instruction B

Instruction C
|| Instruction D
|| Instruction E

Instruction F
|| Instruction G
|| Instruction H

EP1 contains the two parallel instructions A and B; EP2 contains the three par-
allel instructions C, D, and E; and EP3 contains the three parallel instructions F, G,
and H. The FP would be as shown in Figure 3.3. Bit 0 (LSB) of each 32-bit
instruction contains a “p” bit that signals whether it is in parallel with a subsequent
instruction. For example, the “p” bit of instruction B is zero, denoting that it is
not within the same EP as the subsequent instruction C. Similarly, instruction E
is not within the same EP as instruction F.
unit-5 DSP Processors

Pipelining 67

FIGURE 3.3. One fetch packet with three execute packets, showing the “p” bit of each
instruction.

3.5 PIPELINING

Pipelining is a key feature in a digital signal processor to get parallel instructions


working properly, requiring careful timing. There are three stages of pipelining:
program fetch, decode, and execute.

1. The program fetch stage is composed of four phases:


(a) PG: program address generate (in the CPU) to fetch an address
(b) PS: program address send (to memory) to send the address
(c) PW: program address ready wait (memory read) to wait for data
(d) PR: program fetch packet receive (at the CPU) to read opcode from
memory
2. The decode stage is composed of two phases:
(a) DP: to dispatch all the instructions within an FP to the appropriate func-
tional units
(b) DC: instruction decode
3. The execute stage is composed of from six phases (with fixed point) to 10
phases (with floating point), due to delays (latencies) associated with the
following instructions:
(a) Multiply instruction, which consists of two phases due to one delay
(b) Load instruction, which consists of five phases due to four delays
(c) Branch instruction, which consists of six phases due to five delays

Table 3.2 shows the pipeline phases, and Table 3.3 shows the pipelining effects.
The first row in Table 3.3 represents cycle 1, 2, . . . , 12. Each subsequent row repre-
sents an FP. The rows represented PG, PS, . . . , illustrate the phases associated with
each FP. The program generate (PG) of the first FP starts in cycle 1, and the PG of
the second FP starts in cycle 2, and so on. Each FP takes four phases for program
fetch and two phases for decoding. However, the execution phase can take from 1
to 10 phases (not all execution phases are shown in Table 3.3). We are assuming that
each FP contains one execute packet (EP).
For example, at cycle 7, while the instructions in the first FP are in the first exe-
cution phase E1 (which may be the only one), the instructions in the second FP are
in the decoding phase, the instructions in the third FP are in the dispatching phase,
and so on. All seven instructions are proceeding through the various phases. There-
fore, at cycle 7, “the pipeline is full.”
unit-5 DSP Processors

68 Architecture and Instruction Set of the C6x Processor

TABLE 3.2 Pipeline Phases


Program Fetch Decode Execute

PG PS PW PR DP DC E1–E6 (E1–E10 for double precision)

TABLE 3.3 Pipelining Effects


Clock Cycle

1 2 3 4 5 6 7 8 9 10 11 12

PG PS PW PR DP DC E1 E2 E3 E4 E5 E6
PG PS PW PR DP DC E1 E2 E3 E4 E5
PG PS PW PR DP DC E1 E2 E3 E4
PG PS PW PR DP DC E1 E2 E3
PG PS PW PR DP DC E1 E2
PG PS PW PR DP DC E1
PG PS PW PR DP DC

Most instructions have one execute phase. Instructions such as multiply (MPY),
load (LDH/LDW), and branch (B) take two, five, and six phases, respectively. Addi-
tional execute phases are associated with floating-point and double-precision types
of instructions, which can take up to 10 phases. For example, the double-precision
multiply operation (MPYDP), available on the C67x, has nine delay slots, so that the
execution phase takes a total of 10 phases.
The functional unit latency, which represents the number of cycles that an instruc-
tion ties up a functional unit, is 1 for all instructions except double-precision instruc-
tions, available with the floating-point C67x. Functional unit latency is different from
a delay slot. For example, the instruction MPYDP has four functional unit latencies
but nine delay slots. This implies that no other instruction can use the associated
multiply functional unit for four cycles. A store has no delay slot but finishes its
execution in the third execution phase of the pipeline.
If the outcome of a multiply instruction such as MPY is used by a subsequent
instruction, a NOP (no operation) must be inserted after the MPY instruction for the
pipelining to operate properly. Four or five NOPs are to be inserted in case an instruc-
tion uses the outcome of a load or a branch instruction, respectively.

3.6 REGISTERS

Two sets of register files, each set with 16 registers, are available: register file A (A0
through A15) and register file B (B0 through B15). Registers A0, A1, B0, B1, and
B2 are used as conditional registers. Registers A4 through A7 and B4 through B7
are used for circular addressing. Registers A0 through A9 and B0 through B9
(except B3) are temporary registers. Any of the registers A10 through A15 and
unit-5 DSP Processors

Linear and Circular Addressing Modes 69

B10 through B15 used are saved and later restored before returning from a
subroutine.
A 40-bit data value can be contained across a register pair. The 32 least signifi-
cant bits (LSBs) are stored in the even register (e.g., A2) and the remaining 8 bits
are stored in the 8 LSBs of the next-upper (odd) register (A3). A similar scheme is
used to hold a 64-bit double-precision value within a pair of registers (even
and odd).
These 32 registers are considered as general-purpose registers. Several special-
purpose registers are also available for control and interrupts: for example, the
address mode register (AMR) used for circular addressing and interrupt control
registers, as shown in Appendix B.

3.7 LINEAR AND CIRCULAR ADDRESSING MODES

Addressing modes determine how one accesses memory. They specify how data are
accessed, such as retrieving an operand indirectly from a memory location. Both
linear and circular modes of addressing are supported. The most commonly used
mode is the indirect addressing of memory.

3.7.1 Indirect Addressing

Indirect addressing can be used with or without displacement. Register R repre-


sents one of the 32 registers A0 through A15 and B0 through B15 that can specify
or point to memory addresses. As such, these registers are pointers. Indirect address-
ing mode uses a “*” in conjunction with one of the 32 registers. To illustrate, con-
sider R as an address register.

1. *R. Register R contains the address of a memory location where a data value
is stored.
2. *R++(d). Register R contains the memory address (location). After the
memory address is used, R is postincremented (modified), such that the new
address is the current address offset by the displacement value d. If d = 1 (by
default), the new address is R + 1, or R is incremented to the next-higher
address in memory. A double minus (--) instead of a double plus would
update or postdecrement the address to R - d.
3. *++R(d). The address is preincremented or offset by d, such that the current
address is R + d. A double minus would predecrement the memory address
so that the current address is R - d.
4. *+R(d). The address is preincremented by d, such that the current address is
R + d (as with the preceding case). However, in this case, R preincre-
ments without modification. Unlike the previous case, R is not updated or
modified.
unit-5 DSP Processors

70 Architecture and Instruction Set of the C6x Processor

3.7.2 Circular Addressing

Circular addressing is used to create a circular buffer. This buffer is created in hard-
ware and is very useful in several DSP algorithms, such as in digital filtering or
correlation algorithms where data need to be updated. An example in Chapter 4
illustrates the implementation of a digital filter using a circular buffer to update the
“delay” samples.
The C6x has dedicated hardware to allow a circular type of addressing. This
addressing mode can be used in conjunction with a circular buffer to update samples
by shifting data without the overhead created by shifting data directly. As a pointer
reaches the end or “bottom” location of a circular buffer that contains the last
element in the buffer, and is then incremented, the pointer is automatically wrapped
around or points to the beginning or “top” location of the buffer that contains the
first element.
Two independent circular buffers are available using BK0 and BK1 within the
address mode register (AMR), as shown in Appendix B. The eight registers A4
through A7 and B4 through B7, in conjunction with the two .D units, can be used
as pointers (all registers can be used for linear addressing). The following code
segment illustrates the use of a circular buffer using register B2 (only side B can be
used) to set the appropriate values within AMR:

MVK .S2 0x0004,B2 ;lower 16 bits to B2. Select A5 as pointer


MVKLH .S2 0x0005,B2 ;upper 16 bits to B2. Select B0, set N = 5
MVC .S2 B2,AMR ;move 32 bits of B2 to AMR

The two move instructions MVK and MVKLH (using the .S unit) move 0x0004
into the 16 LSBs of register B2 and 0x0005 into the 16 MSBs of B2. The MVC (move
constant) instruction is the only instruction that can access the AMR and the other
control registers (shown in Appendix B) and executes only on the B side in con-
junction with the functional units and registers on the side B. A 32-bit value is
created in B2, which is then transferred to AMR with the instruction MVC to access
AMR [6].
The value 0x0004 = (0100)b into the 16 LSBs of AMR sets bit 2 (third bit)
to 1 and all other bits to zero. This sets the mode to 01 and selects register A5 as
the pointer to a circular buffer using block BK0.
Table 3.4 shows the modes associated with registers A4 through A7 and B4
through B7. The value 0x0005 = (0101)b into the 16 MSBs of AMR sets bits 16
and 18 to 1 (other bits to zero). This corresponds to the value of N used to select
the size of the buffer as 2N+1 = 64 bytes using BK0. For example, if a buffer size of
128 is desired using BK0, the upper 16 bits of AMR are set to (0110)b = 0x0006.
If assembly code is used for the circular buffer, as execution returns to a calling C
function, AMR needs to be reinitialized to the default linear mode. Hence the
pointer’s address must be saved.
unit-5 DSP Processors

TMS320C6x Instruction Set 71

TABLE 3.4 AMR Mode and Description


Mode Description

0 0 For linear addressing (default on reset)


0 1 For circular addressing using BK0
1 0 For circular addressing using BK1
1 1 Reserved

3.8 TMS320C6x INSTRUCTION SET

3.8.1 Assembly Code Format

An assembly code format is represented by the field

Label || [ ] Instruction Unit Operands ;comments

A label, if present, represents a specific address or memory location that contains


an instruction or data. The label must be in the first column. The parallel bars (||)
are there if the instruction is being executed in parallel with the previous instruc-
tion. The subsequent field is optional to make the associated instruction conditional.
Five of the registers—A1, A2, B0, B1, and B2—are available to use as conditional
registers. For example, [A2] specifies that the associated instruction executes if A2
is not zero. On the other hand, with [!A2], the associated instruction executes if A2
is zero. All C6x instructions can be made conditional with the registers A1, A2, B0,
B1, and B2 by determining when the conditional register is zero. The instruction
field can be either an assembler directive or a mnemonic. An assembler directive is
a command for the assembler. For example,

.word value

reserves 32 bits in memory and fill with the specified value. A mnemonic is an
actual instruction that executes at run time. The instruction (mnemonic or assem-
bler directive) cannot start in column 1. The Unit field, which can be one of the
eight CPU units, is optional. Comments starting in column 1 can begin with either
an asterisk or a semicolon, whereas comments starting in any other columns must
begin with a semicolon.
Code for the floating-point processors C3x/C4x is not compatible with code for
the fixed-point processors C1x, C2x, and C5x/C54x. However, the code for the fixed-
point C62x is compatible with the code for the floating-point C67x. C62x code is
actually a subset of C67x code. Additional instructions to handle double-precision
and floating-point operations are available only on the C67x processor (some addi-
tional instructions are also available on the fixed-point C64x processor).
unit-5 DSP Processors
1

[RISC AND CISC]

RISC AND CISC

Generali

1. The dominant architecture in the PC market, the Intel IA-32, belongs to the
Complex Instruction Set Computer (CISC) design. The obvious reason for this
classification is the “complex” nature of its Instruction Set Architecture (ISA). The
motivation for designing such complex instruction sets is to provide an instruction set
that closely supports the operations and data structures used by Higher-Level
Languages (HLLs). However, the side effects of this design effort are far too serious
to ignore.

Addressing Modes in CISC

2. The decision of CISC processor designers to provide a variety of addressing


modes leads to variable-length instructions. For example, instruction length
increases if an operand is in memory as opposed to in a register.
a. This is because we have to specify the memory address as part of
instruction encoding, which takes many more bits.
b. This complicates instruction decoding and scheduling. The side effect
of providing a wide range of instruction types is that the number of
clocks required to execute instructions varies widely.
c. This again leads to problems in instruction scheduling and pipelining.

Evolution of RISCii

3. For these and other reasons, in the early 1980s designers started looking at
simple ISAs. Because these ISAs tend to produce instruction sets with far fewer
instructions, they coined the term Reduced Instruction Set Computer (RISC). Even
though the main goal was not to reduce the number of instructions, but the
complexity, the term has stuck.
4. There is no precise definition of what constitutes a RISC design. However, we
can identify certain characteristics that are present in most RISC systems.
unit-5 DSP Processors
2

[RISC AND CISC]

a. We identify these RISC design principles after looking at why the


designers took the route of CISC in the first place.
b. Because CISC and RISC have their advantages and disadvantages,
modern processors take features from both classes. For example, the
PowerPC, which follows the RISC philosophy, has quite a few complex
instructions.

Figure 1 Typical RISC Architecture based Machine - Instruction phase


overlapping

Definition of RISCiii

5. RISC, or Reduced Instruction Set Computer is a type of microprocessor


architecture that utilizes a small, highly-optimized set of instructions, rather than a
more specialized set of instructions often found in other types of architectures.
a. Evolution/History. The first RISC projects came from IBM, Stanford,
and UC-Berkeley in the late 70s and early 80s. The IBM 801, Stanford
unit-5 DSP Processors
3

[RISC AND CISC]

MIPS, and Berkeley RISC 1 and 2 were all designed with a similar
philosophy which has become known as RISC. Certain design features
have been characteristic of most RISC processors
(1) One Cycle Execution Time. RISC processors have a CPI
(clock per instruction) of one cycle. This is due to the
optimization of each instruction on the CPU and a technique
called ;
(2) Pipelining. A technique that allows for simultaneous execution
of parts, or stages, of instructions to more efficiently process
instructions;
(3) Large Number of Registers. The RISC design philosophy
generally incorporates a larger number of registers to prevent in
large amounts of interactions with memory

Figure 2 Advanced RISC Machine (ARM)


unit-5 DSP Processors
4

[RISC AND CISC]

Non RISC Design or Pre RISC Designiv

6. In the early days of the computer industry, programming was done in


assembly language or machine code, which encouraged powerful and easy to use
instructions. CPU designers therefore tried to make instructions that would do as
much work as possible. With the advent of higher level languages, computer
architects also started to create dedicated instructions to directly implement certain
central mechanisms of such languages.

Figure 3 Typical CISC Architecture – Stack Design

7. Another general goal was to provide every possible addressing mode for
every instruction, known as orthogonality, to ease compiler implementation.
Arithmetic operations could therefore often have results as well as operands directly
in memory (in addition to register or immediate).
8. The attitude at the time was that hardware design was more mature than
compiler design so this was in itself also a reason to implement parts of the
functionality in hardware and/or microcode rather than in a memory constrained
unit-5 DSP Processors
5

[RISC AND CISC]

compiler (or its generated code) alone. This design philosophy became retroactively
termed Complex Instruction Set Computer (CISC) after the RISC philosophy came
onto the scene.
9. An important force encouraging complexity was very limited main memories
(on the order of kilobytes). It was therefore advantageous for the density of
information held in computer programs to be high, leading to features such as highly
encoded, variable length instructions, doing data loading as well as. These issues
were of higher priority than the ease of decoding such instructions.
10. An equally important reason was that main memories were quite slow (a
common type was ferrite core memory); by using dense information packing, one
could reduce the frequency with which the CPU had to access this slow resource.
Modern computers face similar limiting factors: main memories are slow compared to
the CPU and the fast cache memories employed to overcome this are instead limited
in size. This may partly explain why highly encoded instruction sets have proven to
be as useful as RISC designs in modern computers.

Typical Characteristics of RISC Architecturev

11. Designers make choices based on the available technology. As the


technology, both hardware and software, evolves, design choices also evolve.
Furthermore, as we get more experience in designing processors, we can design
better systems. The RISC proposal was a response to the changing technology and
the accumulation of knowledge from the CISC designs. CISC processors were
designed to simplify compilers and to improve performance under constraints such
as small and slow memories. The important observations that motivated designers to
consider alternatives to CISC designs were
a. Simple Instructions. The designers of CISC architectures anticipated
extensive use of complex instructions because they close the semantic
gap. In reality, it turns out that compilers mostly ignore these
instructions. Several empirical studies have shown that this is the case.
One reason for this is that different high-level languages use different
semantics. For example, the semantics of the C for loop is not exactly
unit-5 DSP Processors
6

[RISC AND CISC]

the same as that in other languages. Thus, compilers tend to


synthesize the code using simpler instructions.
b. Few Data Types. CISC ISA tends to support a variety of data
structures, from simple data types such as integers and characters to
complex data structures such as records and structures. Empirical data
suggest that complex data structures are used relatively infrequently.
Thus, it is beneficial to design a system that supports a few simple data
types efficiently and from which the missing complex data types can be
synthesized.
c. Simple Addressing Modes. CISC designs provide a large number of
addressing modes. The main motivations are
(1) To support complex data structures and
(2) To provide flexibility to access operands.
(a) Problems Caused. Although this allows flexibility, it also
introduces problems. First, it causes variable instruction
execution times, depending on the location of the
operands.
(b) Second, it leads to variable-length instructions. For
example, the IA-32 instruction length can range from 1 to
12 bytes. Variable instruction lengths lead to inefficient
instruction decoding and scheduling.
d. Identical General Purpose Registers. Allowing any register to be
used in any context, simplifying compiler design (although normally
there are separate floating point registers).
e. Harvard Architecture Based. RISC designs are also more likely to
feature a Harvard memory model, where the instruction stream and the
data stream are conceptually separated; this means that modifying the
memory where code is held might not have any effect on the
instructions executed by the processor (because the CPU has a
separate instruction and data cache), at least until a special
synchronization instruction is issued. On the upside, this allows both
unit-5 DSP Processors
7

[RISC AND CISC]

caches to be accessed simultaneously, which can often improve


performance.

RISC VS CISC – An Examplevi

12. The simplest way to examine the advantages and disadvantages of RISC
architecture is by contrasting it with its predecessor, CISC (Complex Instruction Set
Computers) architecture.
13. Multiplying Two Numbers in Memory. The main memory is divided into
locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The execution
unit is responsible for carrying out all computations. However, the execution unit can
only operate on data that has been loaded into one of the six registers (A, B, C, D, E,
or F). Let's say we want to find the product of two numbers - one stored in location
2:3 and another stored in location 5:2 - and then store the product back in the
location 2:3

Figure 4 Representation of Storage Scheme for a Generic Computer

14. The CISC Approach. The primary goal of CISC architecture is to complete a
task in as few lines of assembly as possible. This is achieved by building processor
unit-5 DSP Processors
8

[RISC AND CISC]

hardware that is capable of understanding and executing a series of operations. For


this particular task, a CISC processor would come prepared with a specific
instruction (say "MUL").
a. When executed, this instruction loads the two values into separate
registers, multiplies the operands in the execution unit, and then stores
the product in the appropriate register.
b. Thus, the entire task of multiplying two numbers can be completed with
one instruction:

MUL 2:3, 5:2

c. MUL is what is known as a "complex instruction."


d. It operates directly on the computer's memory banks and does not
require the programmer to explicitly call any loading or storing
functions.
e. It closely resembles a command in a higher level language. For
instance, if we let "a" represent the value of 2:3 and "b" represent the
value of 5:2, then this command is identical to the C statement "a = a x
b."
15. Advantage. One of the primary advantages of this system is that the compiler
has to do very little work to translate a high-level language statement into assembly.
Because the length of the code is relatively short, very little RAM is required to store
instructions. The emphasis is put on building complex instructions directly into the
hardware.
16. The RISC Approach. RISC processors only use simple instructions that can
be executed within one clock cycle. Thus, the "MUL" command described above
could be divided into three separate commands:
a. "LOAD," which moves data from the memory bank to a register,
b. "PROD," which finds the product of two operands located within the
registers, and
c. "STORE," which moves data from a register to the memory banks.
unit-5 DSP Processors
9

[RISC AND CISC]

d. In order to perform the exact series of steps described in the CISC


approach, a programmer would need to code four lines of assembly:

LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A

17. Analysis. At first, this may seem like a much less efficient way of completing
the operation. Because there are more lines of code, more RAM is needed to store
the assembly level instructions. The compiler must also perform more work to
convert a high-level language statement into code of this form.
a. Advantage of RISC. However, the RISC strategy also brings some very
important advantages. Because each instruction requires only one
clock cycle to execute, the entire program will execute in approximately
the same amount of time as the multi-cycle "MUL" command. These
RISC "reduced instructions" require less transistors of hardware space
than the complex instructions, leaving more room for general purpose
registers. Because all of the instructions execute in a uniform amount
of time (i.e. one clock), pipelining is possible.
(1) Separating the "LOAD" and "STORE" instructions actually
reduces the amount of work that the computer must perform.
(2) After a CISC-style "MUL" command is executed, the processor
automatically erases the registers. If one of the operands needs
to be used for another computation, the processor must re-load
the data from the memory bank into a register. In RISC, the
operand will remain in the register until another value is loaded
in its place.
b. The following table will differentiate both the architectures and based
on the analysis the overall advantage will be discussed.
unit-5 DSP Processors
10

[RISC AND CISC]

CISC RISC
Emphasis on hardware Emphasis on software
Includes multi-clock complex instructions Single-clock, reduced instruction only
Memory-to-memory: Register to register:
"LOAD" and "STORE" incorporated in "LOAD" and "STORE" are
instructions independent instructions
Small code sizes, high cycles per second Low cycles per second, large code
sizes
Transistors used for storing complex Spends more transistors on memory
instructions registers

Table 1 Comparison of CISC and RISC Architecturesvii

18. The Performance Equation. The following equation is commonly used for
expressing a computer's performance ability:

a. CISC Approach. The CISC approach attempts to minimize the number


of instructions per program, sacrificing the number of cycles per
instruction.
b. RISC Approach. RISC does the opposite, reducing the cycles per
instruction at the cost of the number of instructions per program.
19. The Overall RISC Advantage. Today, the Intel x86 is arguable the only chip
which retains CISC architecture. This is primarily due to advancements in other
areas of computer technology.
a. The price of RAM has decreased dramatically. In 1977, 1MB of DRAM
cost about $5,000.
b. By 1994, the same amount of memory cost only $6 (when adjusted for
inflation). Compiler technology has also become more sophisticated, so
unit-5 DSP Processors
11

[RISC AND CISC]

that the RISC use of RAM and emphasis on software has become
ideal.

RISC Processors (Examples)viii

20. Digital Equipment Corporation (DEC) - Alpha.ix Alpha, originally known as


Alpha AXP, is a 64-bit reduced instruction set computer (RISC) instruction set
architecture (ISA) developed by Digital Equipment Corporation (DEC), designed to
replace the 32-bit VAX complex instruction set computer (CISC) ISA and its
implementations.
a. Alpha was implemented in microprocessors originally developed and
fabricated by DEC.
b. These microprocessors were most prominently used in a variety of
DEC workstations and servers, which eventually formed the basis for
almost their entire mid-to-upper-scale lineup.
c. Several third-party vendors also produced Alpha systems, including PC
form factor motherboards.

Figure 5 DEC Alpha Microprocessor developed in 1995

21. Advanced Micro Devices (AMD) 29000.x The AMD 29000, often simply 29k,
was a popular family of 32-bit RISC microprocessors and microcontrollers developed
and fabricated by Advanced Micro Devices (AMD).
unit-5 DSP Processors
12

[RISC AND CISC]

a. They were, for a time, the most popular RISC chips on the market,
widely used in laser printers from a variety of manufacturers.
b. In late 1995 AMD dropped development of the 29k because the design
team was transferred to support the PC side of the business and was
realigned towards the embedded 186 family of 80186 derivatives.
c. The majority of AMD's resources were then concentrated on their high-
performance, desktop x86 clones, using many of the ideas and
individual parts of the latest 29k to produce the AMD K5.

Figure 6 Advanced Micro Devices (AMD) Microprocessor 29000 Series

22. Advanced RISC Machine (ARM). The ARM is a 32-bit reduced instruction
set computer (RISC) instruction set architecture (ISA) developed by ARM Holdings.
It was known as the Advanced RISC Machine, and before that as the Acorn RISC
Machine.
a. The ARM architecture is the most widely used 32-bit ISA in terms of
numbers produced.
b. They were originally conceived as a processor for desktop personal
computers by Acorn Computers, a market now dominated by the x86
family used by IBM PC compatible computers.
c. The relative simplicity of ARM processors made them suitable for low
power applications.
unit-5 DSP Processors
13

[RISC AND CISC]

d. This has made them dominant in the mobile and embedded electronics
market as relatively low cost and small microprocessors and
microcontrollers.

Figure 7 Advanced RISC Machine (ARM) Microprocessor developed by


Conexant Computers

23. Atmel AVRxi. The AVR is a Modified Harvard architecture 8-bit RISC single
chip microcontroller (µC) which was developed by Atmel in 1996.
a. The AVR was one of the first microcontroller families to use on-chip
flash memory for program storage, as opposed to One-Time
Programmable ROM, EPROM, or EEPROM used by other
microcontrollers at the time.

Figure 8 Atmel AVR ATmega8 Microprocessor

24. Microprocessor without Interlocked Pipeline Stages (MIPSxii). MIPS is a


reduced instruction set computer (RISC) instruction set architecture (ISA) developed
by MIPS Computer Systems.
unit-5 DSP Processors
14

[RISC AND CISC]

a. The early MIPS architectures were 32-bit, and later versions were 64-
bit.
b. Multiple revisions of the MIPS instruction set exist, including MIPS I,
MIPS II, MIPS III, MIPS IV, MIPS V, MIPS32, and MIPS64.
c. The current revisions are MIPS32 (for 32-bit implementations) and
MIPS64 (for 64-bit implementations).
d. MIPS32 and MIPS64 define a control register set as well as the
instruction set.

Figure 9 MIPS Microprocessor

25. Precision Architecture – Reduced Instruction Set Computer (PA-RISC).


xiii
PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard.
a. The design is also referred to as HP/PA for Hewlett Packard Precision
Architecture.
b. The architecture comprised of the HP 3000 Series 930 and HP 9000
Model 840 computers.
c. PA-RISC has been succeeded by the Itanium (originally IA-64) ISA
jointly developed by HP and Intel.
d. They have stopped the production of HP 9000 series but the support
plan is till the year 2013.
unit-5 DSP Processors
15

[RISC AND CISC]

Figure 10 HP 9000 PA-RISC Microprocessor

26. Performance Optimization with Enhanced RISC – Performance


Computing (POWER-PC)xiv. PowerPC is a RISC architecture created by the 1991
Apple–IBM–Motorola alliance, known as AIM.
a. Originally intended for personal computers design
b. Used in high performance processors.
c. PowerPC is largely based on IBM's earlier POWER architecture, and
retains a high level of compatibility with it.

Figure 11 PowerPC 600 Series developed by IBM Computer

27. SuperHxv. SuperH (SH) is a 32-bit reduced instruction set computer (RISC)
instruction set architecture (ISA) developed by Hitachi. It is implemented by
unit-5 DSP Processors
16

[RISC AND CISC]

microcontrollers and microprocessors for embedded systems. Its major categories


are
a. SH-1. Used in microcontrollers for deeply embedded applications (CD-
ROM drives, major appliances, etc)
b. SH-2. Used in microcontrollers with higher performance requirements,
also used in automotive such as engine control units or in networking
applications, and also in video game consoles, like the Sega Saturn.
The SH-2 has also found home in a great many motor control
applications.
c. SH-DSP. Initially developed for the mobile phone market, used later in
many consumer applications requiring DSP performance for JPEG
compression etc
d. SH-3. Used for mobile and handheld applications such as the Jornada,
strong in Windows CE applications and market for many years in the
car navigation market
e. SH-3 DSP. Used mainly in multimedia terminals and networking
applications, also in printers and fax machines
f. SH-4. Used whenever high performance is required such as car
multimedia terminals, video game consoles, or set-top boxes
g. SH-5. Used in high-end multimedia applications

Figure 12 SuperH Microprocessor


unit-5 DSP Processors
17

[RISC AND CISC]

28. Scalable Processor Architecture (SPARC).xvi SPARC is a RISC instruction


set architecture (ISA) developed by Sun Microsystems and introduced in 1986.
a. Implementations of the original 32-bit SPARC architecture were initially
designed and used for Sun's Sun-4 workstation and server systems,
replacing their earlier Sun-3 systems based on the Motorola 68000
family of processors.
b. Later, SPARC processors were used in servers produced by Sun
Microsystems and designed for 64-bit operation.

Figure 13 SPARC II Microprocessor developed by Sun

Conclusion

29. We have introduced important characteristics that differentiate RISC designs


from their CISC counterparts. CISC designs provide complex instructions and a large
number of addressing modes. The rationale for this complexity is the desire to close
the semantic gap that exists between high-level languages and machine languages.
In the early days, effective usage of processor and memory resources was
important. Complex instructions tend to minimize the memory requirements.
Empirical data, however, suggested that compilers do not use these complex
instructions; instead, they use simple instructions to synthesize complex instructions.
Such observations led designers to take a fresh look at processor design philosophy.
RISC principles, based on empirical studies on CISC processors, have been
proposed as an alternative to CISC. Most of the current processor designs are based
unit-5 DSP Processors
18

[RISC AND CISC]

on these RISC principles.

References

i
"Microprocessors From the Programmer's Perspective" by Andrew Schulman 1990

ii
http://cse.stanford.edu/class/sophomore-college/projects-00/risc/whatis/index.html

iii
Stanford sophomore students defined RISC as “a type of microprocessor
architecture that utilizes a small, highly-optimized set of instructions, rather than a
more specialized set of instructions often found in other types of architectures”.

iv
"Guide to RISC Processors for Programmers and Engineers": Chapter 3: "RISC
Principles" by Sivarama P. Dandamudi, 2005, ISBN 978-0-387-21017-9. "the main
goal was not to reduce the number of instructions, but the complexity"

v
http://www.cpushack.net/CPU/cpuAppendA.html

vi
"CISC, RISC, and DSP Microprocessors" by Douglas L. Jones 2000

vii
http://cse.stanford.edu/class/sophomore-college/projects-00/risc/risccisc/

viii
http://en.wikipedia.org/wiki/RISC

ix
http://en.wikipedia.org/wiki/DEC_Alpha

x
http://en.wikipedia.org/wiki/AMD_29k

xi
http://en.wikipedia.org/wiki/Atmel_AVR

xii
http://en.wikipedia.org/wiki/MIPS_architecture

xiii
http://en.wikipedia.org/wiki/PA-RISC

xiv
http://en.wikipedia.org/wiki/Power_Architecture

xv
http://en.wikipedia.org/wiki/SuperH

xvi
http://en.wikipedia.org/wiki/SPARC

You might also like