You are on page 1of 4


DSP Implementation Platform:

TMS320C6x Architecture and Software Tools

Implementing some or most components of a signal processing system on a DSP processor is often computationally more efficient. The choice of a DSP processor to use in a signal processing system is application dependent. Many factors influence this choice, including cost, performance, power consumption, ease-of-use, time- to-market, and integration/interfacing capabilities.

8.1 TMS320C6X DSP

The family of TMS320C6x processors, manufactured by Texas Instruments, is built to deliver speed. They are designed for a million instructions per second (MIPS) intensive applications such as digital video. There are many processor versions belonging to this family differing in instruction cycle time, speed, power con- sumption, memory, peripherals, packaging, and cost. For example, the fixed-point C6416-600 version operates at 600 MHz (1.67 ns cycle time), delivering a peak performance of 4800 MIPS. The floating-point C6713-225 version operates at 225 MHz (4.4 ns cycle time), delivering a peak performance of 1350 MIPS.

Figure 8-1 shows a block diagram of the generic C6x architecture. The C6x central processing unit (CPU) consists of eight functional units divided into two sides: A and B. Each side has an .M unit (used for multiplication operation), an .L unit (used for logical and arithmetic operations), an .S unit (used for branch, bit manipulation, and arithmetic operations), and a .D unit (used for loading, storing, and arithmetic operations). Some instructions such as ADD can be done by more than one unit. Sixteen 32-bit registers are associated with each side. Interaction with the CPU must be done through these registers.


Program RAM Data RAM Addr Internal Buses DMA D(32) Serial Port EMIF .D1 .D2 Host
Program RAM
Data RAM
Internal Buses
Serial Port
Host Port
Boot Load
Control Regs
Pwr Down
Regs (B0-B15)
Regs (A0-A15)

Figure 8-1: Generic C6x architecture.

As shown in Figure 8-2 the internal buses consist of a 32-bit program address bus, a 256-bit program data bus accommodating eight 32-bit instructions, two 32-bit data address buses (DA1 and DA2), two 32-bit (64-bit for C64 version) load data buses (LD1 and LD2), and two 32-bit (64-bit for the floating-point version) store data buses (ST1 and ST2). In addition, there are a 32-bit direct memory access (DMA) data and a 32-bit DMA address bus. The off-chip, or external, memory is accessed through a 20-bit address bus and a 32-bit data bus.

The peripherals on a typical C6x processor include External Memory Interface (EMIF), DMA, Boot Loader, Multi-channel Buffered Serial Port (McBSP), Host Port Interface (HPI), Timer, and Power Down unit. EMIF provides the necessary timing for accessing external memory. DMA allows the movement of data from one place in memory to another place without interfering with the CPU operation. Boot Loader boots the code from off-chip memory or HPI to the internal memory. McBSP provides a high-speed multi-channel serial communication link. HPI allows a host to access the internal memory. Timer provides two 32-bit counters. The Power Down unit is used to save power for durations when the CPU is inactive.

8.1.1 Pipelined CPU

In general, there are three basic steps to perform an instruction. They include fetching, decoding, and execution. If these steps are done serially, not all of the resources on the processor, such as multiple buses or functional units, are fully utilized. In order to increase throughput, DSP CPUs are designed to be pipelined.


Program Addr


Program Data


Data Addr - A (DA1)


Load Data - A (LD1)


Store Data - A (ST1)


Data Addr - B (DA2)


Load Data - B (LD2)


Store Data - B (ST2)


DMA Addr - Read


DMA Data - Read


DMA Addr - Write


DMA Data - Write


.D1 .D2 DMA

Figure 8-2: C6x internal buses.

This means that the foregoing steps are carried out simultaneously. Figure 8-3 illustrates the difference in the processing time for three instructions executed on a serial or non-pipelined and a pipelined CPU. As one can see from this figure, a pipe- lined CPU requires fewer clock cycles to complete the same number of instructions.

The C6x architecture is based on the Very Long Instruction Word (VLIW) architecture. In such architectures, several instructions are captured and processed simultaneously. For more details on the TMS320C6000 architecture, the interested reader is referred to [1].

8.1.2 C64x DSP

The C64x is a more recently released DSP core, as part of the C6x family, with higher MIPS power operating at higher clock rates. This core can operate in the range of 300–1000 MHz clock rates, giving a processing power of 2400–8000 MIPS. The C64x speedups are achieved due to many enhancements, some of which are mentioned here.


Chapter extract

The publisher detailed in the title page holds the copyright

The publisher detailed in the title page holds the copyright for this document

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recorded or otherwise, without the written permission of Spenford IT Ltd who are licensed to reproduce this document by the publisher

All requests should by sent in the first instance to

Please ensure you have book-marked our website.