This action might not be possible to undo. Are you sure you want to continue?
EDN’S ANNUAL DSP DIRECTORY HIGHLIGHTS THE ARCHITECTURES AVAILABLE FOR YOUR HOTTEST DESIGNS. HERE’S HELP IN SORTING THROUGH THE MYRIAD DSP DEVICES. YOU CAN ALSO ACCESS OUR FREQUENTLY UPDATED, FEATURE-TUNED DATABASE USING OUR SEARCH ENGINE TO FIND THE RIGHT DEVICE FOR YOUR DESIGN NEEDS.
By Markus Levy, Technical Editor
I’m beginning to sound like a broken record. (Remember those vinyl platters?) Every year I begin the introduction to EDN’s DSP Directory by remarking on the tremendous growth in DSP technology, and it’s no different this year. You can judge this growth from the number of new DSP companies and the number of new DSPs. And you’ll find descriptions of all the above right here in this directory. What’s the cause of all this excitement? Cellular phones, broadband communications, medical-imaging equipment, modems, audio equipment, motor control, and tons more. You’ve probably seen the commercials from Texas Instruments about DSP technology—another testimony that DSPs are penetrating our lives. In addition, plenty of RISC processors keep popping up with DSP instruction sets. (The appearance of RISC processors with DSP instruction sets, however, raises the question “Are they really still RISCs?”) These processors include those from ARC, ARM, Improv Systems, Lexra, MIPS, Sandcraft, STMicro, Tensilica, and more (www.arccores.com, www.arm.com, www.improvsys.com, www.lexra.com, www.mips.com, www.sandcraft.com, www.st.com, www.tensilica.com). Although this directory does not cover these processors, you’ll find them in the upcoming Microprocessor Directory. There, you’ll also find the traditional RISC-DSP combos, such as Hitachi’s SH-DSP, Infineon’s TriCore, and Hyperstone’s E1 (www.hitachi.com/semiconductor, www.hyperstone.de). I may also be repeating myself when I talk about the new benchmarks from the EDN Embedded Microprocessor Benchmark Consortium (EEMBC). The consortium has been actively working on these industry-standard benchmarks for three years, and now the benchmarks are ready to go. Starting on April 11, you can go to the EEMBC Web site at www.eembc.org and get free DSP and other processor benchmark-certified scores. These benchmarks include cascaded biquad filters, the Viterbi address-compare-select function, autocorrelation, bit allocation, and FFTs. If you don’t find benchmark scores for your favorite DSP on EEMBC’s Web site, urge the corresponding vendor to provide its EEMBC scores. Let them tie some quantitative performance information to those multiply-accumulate units and pipeline stages. Next year, we’ll include those scores in this directory for direct comparisons. The Motorola and Lucent Technologies joint venture has yielded the fruits of the two companies’ labors in the StarCore scalable SC140 DSP core and the MSC8101. However, we’re still anxiously awaiting a DSP architecture from the Intel/Analog Devices partnership. An introduction should happen soon. We’re also waiting to see production quantities of TI’s new 750-MHz C64x: the fastest RISC—er, DSP—on the planet. To help you sort through the myriad DSP devices, access our frequently updated database using our feature-tuned search engine to find the right device for your design needs (www. ednmag.com/ednmag/reg/micro.asp).
60 edn | March 30, 2000
16 BITS Analog Devices ADSP-21xx . . .62 Analog Devices TigerSharc DSP . . . . . . . . . . . .63 BOPS ManArray . . . . . . . . . . . .64 DSP Group cores . . . . . . . . . . . .66 Equator Technologies MAP-CA . . . . . . . . . . . . . . . . . .68 Infineon Carmel DSP 10XX and 20XX cores . . . . . . .70 LSI Logic ZSP DSPs . . . . . . . . . .72 Lucent/Motorola StarCore SC100 . . . . . . . . . . . .74 Lucent Technologies DSP16xx . . . . . . . . . . . . . . . . . .76 Lucent Technologies DSP16000 . . . . . . . . . . . . . . . . .78 Massana FILU-200 DSP coprocessor core . . . . . . . . . . . .80 Motorola DSP56800 . . . . . . . . .81 NEC SPRX DSP . . . . . . . . . . . . .82 Philips REAL DSP . . . . . . . . . . .83 Texas Instruments TMS320C2000 . . . . . . . . . . . . .84 Texas Instruments TMS320C5000 . . . . . . . . . . . . .86 Texas Instruments TMS320C6000 . . . . . . . . . . . . .88 3DSP SP-3 and SP-5 DSP cores . . . . . . . . . . . . . . . . .90 Zilog Z893x1/Z893x3 . . . . . . . .92 24 BITS Motorola DSP563xx . . . . . . . . .93 32 BITS Analog Devices SHARC DSP . . . . . . . . . . . . . . .94 Texas Instruments TMS320C3x . . . . . . . . . . . . . . .95 Texas Instruments TMS320C4x . . . . . . . . . . . . . . .96 64 BITS Module Research Center’s NeuroMatrix NM6403 DSP . . . . . . . . . . . . . .98
FOR MORE INFORMATION...
Analog Devices www.analog.com Circle No. 376 Billions of Operations Per Second www.bops.com Circle No. 377 DSP Group www.dspg.com Circle No. 378 Equator Technologies www.equator.com Circle No. 379 Infineon Technologies www.infineon.com Circle No. 380 LSI Logic www.lsil.com Circle No. 381 Lucent Technologies www.lucent.com Circle No. 382 Massana www.massana.com Circle No. 383 SUPER CIRCLE NUMBER For more information on the products available from all of the vendors listed in this box, circle one number on the reader service card. Circle No. 391 Module Research Center www.module.vympel.msk.ru/ Circle No. 384 Motorola Inc www.motorola-dsp.com Circle No. 385 NEC Electronics Inc www.nec.com Circle No. 386 Philips Semiconductor www.semiconductors.philips.com Circle No. 387 Texas Instruments Inc www.micro.ti.com Circle No. 388 3DSP Corp www.3dsp.com Circle No. 389 Zilog www.zilog.com Circle No. 390
Photo courtesy Texas Instruments
March 30, 2000 | edn 61
Analog Devices ADSP-21xx
The ADSP-21xx family’s CPU handles general processing needs and executes all instructions in a single cycle. All of Analog Devices’ 16-bit DSPs are codecompatible, and many are also pin-compatible. All DSPs feature an algebraic programming syntax. The processor can execute multiple operations per cycle. The multiply-accumulate (MAC) unit, ALU, and barrel shifter are separate but cannot execute in parallel. Secondary registers shadow each execution unit’s registers, allowing fast context switching for interrupt processing. If you need extended precision, you can address the MAC unit’s 40-bit accumulator (includes 8 guard bits) as two 16-bit and one 8-bit register and individually copy the contents to another register. The barrel shifter moves 16-bit inputs left or right into a 32-bit register. The shifter also includes hardware support to perform logical and arithmetic shifts in addition to exponent detection and normalization for block floating point and increasing the precision of a 16-bit DSP. Algorithms such as FFTs in which bits grow from stage to stage use block floating point. A programmer may use the shifter to convert Architecture features a between fixed- and floating-point num40-bit accumulator bers. The ADSP-219x DSPs expand on with 8 guard bits. the architecture by providing two 40-bit accumulators and a 40-bit shifter result. Device offers singleADSP-21xx family members have X cycle execution. and Y data-address generators (DAGs) DSP performs condiand separate program and data buses. tional execution of Two DAGs provide addresses for simulmost instructions. taneous dual-operand fetches (from data and program memory). Each DAG maintains and updates four address pointers. You may associate a length value with each pointer to implement automatic modulo addressing for circular buffers. While executing from the on-chip memory, the buses feed the X- and Y-data values for each MAC cycle. Thus, you can use program memory as data memory to hold constants for single-cycle MAC processing. The program bus is free for MAC use when the CPU executes from on-chip program memory. The dual-ported program memory allows two memory accesses in one cycle. The ADSP-219x DSP uses an instruction cache to achieve three-bus performance. For access to external memory, the ADSP-21xx has a programmable wait-state generator for zero to 15 wait states. Analog Devices’ designers opted for a 16-bit-wide data word and a 24-bit-wide instruction word. The wider instruction word lets the device use more complex instructions and offers more flexibility than does a 16-bit operation code. For external-memory design, the different memory widths mean that if you let three 8-bit-wide memory chips share program and data, you sacrifice every third byte of the data-memory area. Analog Devices integrates as much as 2 Mbits of SRAM around its DSP core to help increase data-transfer efficiency. Many ADSP-21xxs also integrate DMA ports that connect to external hosts or external memory. These bidirectional, byte-wide ports can directly access as much as 4 Mbytes of external memory for offchip storage of program overlays or data tables. Addressing modes—The ADSP-21xx includes immediate, register-direct, memory-direct, and registerindirect addressing modes. The ADSP-219x adds register, indirect-postmodify, immediate-modify, and direct- and indirect-offset addressing modes. The program sequencer features internal loop counters and loop stacks, enabling looped code to execute with zero overhead. Each address generator supports as many as four circular buffers, each with three registers. The registers define the end, length, and access address. One address generator provides bit-reversed addressing. The ADSP-219x supports as many as 16 circular buffers by using a DAG shadow register and a set of base registers for additional circular-buffering flexibility. Special instructions—The ADSP-21xx can conditionally execute most instructions. A do-until command establishes a sequence of instructions that can be arbitrary in length and nested four deep for repeat operations. The ADSP-219x allows as many as eight nesting levels. In addition to the standard arithmetic and logic instructions, the ALU supports division primitives. Because the ADSP-21xx is a nonpipelined machine, it incurs no penalties for jumps and calls. Support—Analog Devices’ software- and hardwaredevelopment tools include the company’s VisualDSP integrated development environment, in-circuit emulators, and a development kit. VisualDSP provides the interface to an optimizing C compiler, an assembler, a linker, a simulator, and a debugger. Analog Devices’ emulators are available for Universal Serial Bus, PCI, and Ethernet host platforms. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP.
62 edn | March 30, 2000
TigerSharc has no hardware modes. making it easier for programming in C. using both computation units with 16-bit data. Users can configure three banks of on-chip memory.com March 30. Analog Devices’ emulators are available for Universal Serial Bus. TigerSharc provides opDSP has SIMD capational saturation for all cases. Each computation unit has its own register file of 32 32-bit registers. and a debugger. but Analog Devices has plugged 14 DMA channels into TigerSharc to facilitate this process. However. Special instructions—The instruction set directly supports transformations between data types of higher and lower numerical precision— for example. incircuit emulators. 2000 | edn 63 . the complex architecture of TigerSharc will challenge programmers and code-generation tools to keep the pipeline and its computation units completely busy. and a 64-bit shifter. and register-direct and -indirect addressing. The register file is orthogonal. and a development kit. The first device in this family. circular buffering. and bit reversal. pointer updates. The address generators support data addressing. 16-. VisualDSP provides the interface to an optimizing C compiler. Using both computation units. the processor automatically inserts stall cycles whenever the result of an operation is unavailable. This instructions per cycle. an ALU. The architecture has two data-address generators that work with two 128-bit data buses to transfer as many as 256 bits per cycle between the computation units and memory. eight-stage pipeline. such as signed. the instruction set supports VLIW architecture exearithmetic attributes. An SIMD-memory-transfer mechanism lets a single load or store instruction specify that two memory transfers must occur between two memory blocks to or from two computation units. and. However.to floating-point or 16. features integrated peripherals. simplifies high-level language programming. Instruction-level parallelism is determined before runtime to support deterministic execution for real-time applications. multiple-data (SIMD) features of this architecture allow you to perform arithmetic operations on multiple 32-bit floating-point values or multiple 8-. The architecture comprises two computation blocks. so you can tune program and data partitioning for an application’s needs. glueless multiprocessing. bilities.dspdirectory 16 bits Analog Devices TigerSharc DSP Designed for the telecommunications infrastructure. the ADSPTS001. and fractional.to 32-bit in one cycle. and Ethernet host platforms. www. Applications should avoid going off-chip for memory accesses.ednmag. you can perform two 32 32-bit multiply-accumulates (MACs) per cycle. you can perform eight 16 16-bit MACs per cycle. and JTAG ports. You can also form 128-bit destinations for the multipliers by combining four consecutive 32-bit registers. going from fixed. the TigerSharc devices use a very-long-word-instruction (VLIW) load/store architecture to execute as many as four instructions per cycle with its interlocking. You can also use the address generators for generalpurpose integer computations. Another challenge will be to keep the processor fed. and link. a linker. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP. with an interlocked pipeline. bit-reversed. each block contains a multiplier. or 32-bit fixedpoint values. You can combine two 32-bit registers for a single 64-bit register. an assembler. circular-modulo. a simulator. The single-instruction. cutes as many as four unsigned. such as 6 Mbits of SRAM. This approach results in 12 Gbytes/sec of on-chip bandwidth at 250 MHz. Addressing modes—TigerSharc offers immediate. external bus. integer. Support—Analog Devices’ softwareThe first TigerSharc and hardware-development tools inintegrates 6 Mbits of clude the company’s VisualDSP inteRAM. PCI. grated development environment.
The ManArray achieves a high-level of parallelism by combining an indirect-very-long-instruction-word (iVLIW) architecture with single-instruction-multiple-data (SIMD) instructions and inherent multiprocessing capability. Using the iVLIW architecture. The term “iVLIW” refers to the ManArray’s ability to indirectly access an encapsulated instruction sequence into a horizontal VLIW format that can simultaneously execute operations. This nontraditional use of VLIWs effectively creates instructions for applications using 32-bit instruction paths. zero-latency. 32bit. The SP handles program control and combines with a PE to form the smallest increment of the ManArray architecture: a single-SP. an ALU. a one one-element array. you issue a sequence of simple instructions that form the iVLIW. or “broadcast. The DSU supports data-manipulation instructions. floating-point formats. and the DSU provides a complement of bitmanipulation. as well as the VIM address in which to store the instructions. a 64-bit load unit. The organization depends on algorithmic data-flow requirements. gle-cycle PE-interconnect bus for movArchitecture features ing data between the SP and PEs. an iVLIW-memVLIW. The five execution units comprise a multiply-accumulate unit (MAU). Addressing modes—The processors support arrayparallel memory-addressing modes. After executing the LV instruction. a sinfloating point. The VIMs allow BOPS to use a single 32-bit instruction bus in the array of PEs. A developer can configure each of the family members into 16. The BOPS 2010 core. or both. reducing the amount of program memory. A single-cycle. and modulo indexed. The ManArray architecture comprises a sequence processor (SP) and a processing element (PE). which identifies as many as five programmer-defined 32-bit instructions that comprise the VLIW. register indirect. The load and store units provide independent datapaths between the SP data memory and the PEs and between each PE and its local data memory. interprocessor-communications fabric and direct DMA access to all processing elements enhance the ManArray’s parallelism. and PE-to-PE-communiwww. as is common with VLIW machines. single-precision. such as shift. The SIMD instructions support 8-. Once the SP and PEs store the iVLIW in VIM. soft-macro DSP cores. this approach promotes scaling in both the number of PEs and the width of the iVLIWs. The register file logically performs as 32 32-bit registers or 16 64-bit registers supporting packed-data operations on an instruction-by-instruction basis. rotate. Likewise. single-precision. Special instructions—The MAU and ALU support floating-point and packed-data operations with saturation. and 32-bit packed data and 32-bit. local data memory. shift. 16-. ory (VIM) unit. your program can dispatch. and three bus interfaces. You create iVLIWs with an iVLIW load VLIW (LV) instruction. The various product configurations combine one SP and multiple PEs. including direct. mesh.” an execute-iVLIW (XV) instruction to the SP and all PEs. 16-. thereby providing zero-latency data transfers between PEs. struction-level paralEach PE contains a multiported. floating-point formats. The BOPS2020 uses the one one-element array and adds a PE element through the CS to form a one two-element array. BOPS requires no large VLIW buses around the chip.ednmag. 2000 . The topology of the BOPS architecture allows the devices to interconnect and organize a set of PEs into standard ring.com 64 edn | March 30. the BOPS2040 is a two two-element array comprising one SP/PE combination and three PEs. Each SP/PE unit also includes the instruction and data-address-generaDevice provides intion units. The bus interfaces DSP supports 8-.and 32-bit. and a 64-bit store unit. lelism with indirect 32 32-bit register file. rotate. base plus displacement. fixed-point formats. single-PE unit. floating-point conversions. a 32-bit and 32-bit fixed and data bus. include a 32-bit instruction bus. The SP uses a 32-bit instruction set that supports both one one-element and N M-element arrays and allows you to use one tool set for all combinations of the cores. The importance of the topology type is that the performance of any parallel algorithm depends on the efficiency of data movement on the processor and the cost of the interconnection mechanism. and a cluster switch (CS).dspdirectory 16 bits BOPS ManArray The Billions of Operations Per Second (BOPS) Inc ManArray is a family of reusable and scalable DSP cores. The iVLIW architecture allows you to overlap the communications operations with the computation operations. and other organizations. hypercube. and SP-to-PE and PE-toPE communications through the CS. The architecture accomplishes this task by placing the communications instructions in the DSU and using software pipelining to transfer a result that an arithmetic-execution unit calculates in the previous machine cycle to any of the directly connected ManArray PEs. torus. comprises an SP/PE combination. a data-select unit (DSU). The XV instruction contains the offset of the VIM address for the VLIW to execute in each PC.
including packed-data multiply-accumulate.com/training/demos. and multiply complex data.cations operations. BOPS also has a compiler for Matlab.ednmag. ManArray fixed-point processors support a number of special DSP instructions. including the disassembly of iVLIWs in VIM and pipeline stages. You can find some demonstrations of this architecture at http://bopsnet. a Gnu-C compiler with assembler/preprocessor/linker. www. a visual system simulator. a cycle-accurate instruction-set simulator.shtml. The instruction-set simulator provides views of all core resources.com Month XX. and a basic DSP library. Support—The BOPS software-development kit comprises an integrated development environment. 2000 | edn xx .
one of which is a three-input. instruction loop and a block repeat. PalmDSPCore. and an exTeakLite is a cost-effecpanded instruction set. support for faster task switching. and the user-definable registers. a three-input ALU. the three buses move X. The stack pointer references the top of the software stack for interrupts. Teak and PalmDSPCore include saturation mode to support bit-exact standards. four capability. PalmDSPCore is compatible with previous generations of the DSP Group cores. and 24-bit versions.and Ydata space and 64k-word program space. both are binary-compatible with OakDSPCore. Teak and PalmDSPCore incorporate two buses for moving data from X memory. The ALU performs arithmetic/logical operations on the data operands. It has 16-bit pointer registers for addressing: a stack pointer. signed or unsigned numbers and deliver a 32-bit 2’s complement product in one cycle. allowing them to easily replace OakDSPCore. These registers can be handy for application-specific hardware. The accumulator optionally saturates out-of-range values as the CPU transfers them to registers or memory. PalmDSPCore also contains more register resources. pands on OakDSPCore by adding a second multiply-accumulate (MAC) unit. and PalmDSPCore and has 34 licenses. TeakDSPCore is a family of two synthesizable cores: TeakLite and Teak. or both. a data-arithmetic-address-generator unit (DAAU).dspdirectory 16 bits DSP Group cores DSP Group developed PineDSPCore. PineDSPCore has two data buses and one program bus. 20. and an automatic-boot procedure. the PalmDSPCore’s input registers multiplies by 16. all in one cycle. a parallel-instruction-word architecture. a 36-bit ALU. At each cycle. two for moving data from Y memory. OakDSPCore tive version. 48. The four accumulators and a set of shadow registers enable rapid context switches. OakDSPCore. two data-memory blocks for X and Y memory. and additional addressing modes. or 48-bit product. and 56 bits for PalmDSPCore). DSP Group’s PalmDSPCore is a family of fully synthesizable cores in 16-. PalmDSPCore’s BMU supports parity and its insert/extract unit supports packing and unpacking of bit fields into the accumulator. the computation unit.and Y-memory addresses for each MAC cycle and modifies the pointers after operations. The X space and the program memory can be in internal memory. or 56 bits for PalmDSPCore). the BMU. The DAAU generates X. such as step division and rounding. a multiplier. downloading capabilities from data-memory to program-memory space. TeakLite. respectively. 48. Cellular phones use bit-field operations to pack the bits before putting the data onto the channel. The shifter between the multiplier and the ALU sign-extends the product to accumulator size through the 4 guard bits (8 in Teak and PalmDSPCore). The X-data bus also serves as the main CPU data bus by linking the two data RAMs. and two additional 36-bit accumulators (40 bits for Teak and 40. 48. such as a Viterbi accelerator. has seven computation units that operate in parallel. and two accumulators. nesting levels of block-repeat. or 24 bits and yields a 32-. a base register. the PCU. also has an indexed addressing mode and a software stack to improve its CPalmDSPCore includes programming friendliness. insertion/extraction unit. OakDSPCore and PineDSPCore have a 64k-word X. the status registers. 20-. these accumulators can also evaluate 36-bit exponents (40 bits for Teak and 40.ednmag. subroutine-processing calls.com 66 edn | March 30. bypassing the accumulator. You can define four (16 in PalmDSPCore) additional on-chip registers that are not part of the DSP core.and Y-memory data to the MAC unit from X and Y data RAMs or ROMs while the program-control unit (PCU) fetches a new instruction from program-space ROM or RAM. Teak and www. and registers for modulo and step postmodification. Depending on the version. The DAAU contains an alternative bank of registers for interrupts and subroutines. but also avoids the use of the accumulator’s high-power-consumption circuitry. It also includes two zero-overA variety of synthesizhead-loop mechanisms: a single-word able cores is available. modifies. OakDSPCore expands PineDSPCore Teak is a dual-MAC capabilities by adding a bit-manipulacore with parallelism tion unit (BMU). including an accumulator-register file. 2000 . TeakDSPCore. It also performs functions. and writes back to memory. The core includes two multipliers. The bit-fieldoperation unit reads from memory. including modulo addressing. The multipliers for all the cores have two 16-bit input registers and take two 16-bit. 40-. The BMU has a 36-bit barrel shifter (40 bits for Teak and 40. and 56 bits for PalmDSPCore). and three adder-subtracter units (ASUs). the DAAU. and temporary variable saving. a bit-field-operation unit. an exponent unit. Other units include a barrel shifter and an exponent unit. Bypassing the accumulator not only frees a critical hardware resource. Teak exseven arithmetic units. Teak and PalmDSPCore use a switchbox unit rather than a main-CPU-data-bus method. and one for moving data from program memory. All cores support DMA operations. external memory.
7V and have power-management features to cut power dissipation. Teak and Palm can have a quadruple-indirect-addressing mode to simultaneously feed the four inputs of the two multipliers and bit-reversal addressing mode. OakDSPCore. Teak and Palm include special instructions to support coprocessors and built-in accelerators for Viterbi acceleration. and Z space for peripherals. Palm adds support for delayed branches. and the Assyst simulator. minimum/maximum calculation with pointer latching and modification. a preprocessor. Internal control can automatically shut off unused functional units and memory. Check out DSP Group’s third-party developers at www. and other algorithms. a loader. Special instructions—The cores support conditional subroutine call/return from a subroutine and interruptible. many conditional instructions. a debugger. relative.com Month XX. All tools run under Windows and Solaris. (PineDSPCore has one repeat level and one block-repeat. Support—DSP Group provides software-development tools along with evaluation/development boards and on-chip-emulation capabilities through the JTAG interface. normalization. OakDSPCore has four levels of nesting. bit-field test.and block-repeat instructions. PalmDSPCore offers as much as 1M word of linear memory and 16M words of paged memory. dspg. and special instructions for least mean square. It contains support that allows you to extend the simulator and add logic to customize the debugger’s hardware interface and to perform multicore debugging. delayed return. move data/program memory.com/prodtech/core/3rdparty. Y (ROM/RAM). Teak and PalmDSPCores allow you to use an infinite number of repeats and block repeats with a special instruction.htm. context switching. and PalmDSPCores also support index and long direct-addressing modes. 2000 | edn xx . vector quantization. Teak. set.8 to 2. double-precision calculations. and automatic boot.) They also support division step. www. All cores have a 16-bit loop counter for repeating instructions or instruction blocks as many as 65.536 times. and short and long immediate-addressing modes. and other functions. direct. OakDSPCore allows you to nest a repeat instruction in a loop block with as many as four levels of block nesting. compare. The software tools include an assembler/linker. a C/C++ compiler. indirect. The debugger works in simulation or emulation mode and includes an application profiler. bit-field operations. square. All cores can operate from 1. register. FFTs. reset and change (except for the PineDSPCore). accumulate/subtract previous product. conditionally modify accumulator.ednmag. Teak uses linear program memory as large as 256k words and paging as large as 4M words.PalmDSPCore split the 64k-word data space into X. exponent evaluation. Addressing modes—All cores support circular buffering.
The VF pumps out scaled 4:4:4 YUV data to the SDRAM or the DRC through the video bus.or 66-MHz PCI bus. multithreaded DMA engine that provides buffered data transfer between MAP-CA memory subsystems or between memories and I/O devices. digital TVs. Special instructions—The MAP-CA performs shift/extract/merge operations and 64-bit SIMD operations with 8-. 33. hardware-accelerated MPEG-2 table look-up. 32-. 16 1piler only. The compiler encodes operation in a 34-bit format—32 bits plus a 2-bit field in a block header. The processor. 16-. MPEG-2 Transport Channel Interface. such as set-top boxes. bit predicate registers. and 32-bit partitions. It also handles 128-bit partitioned SIMD operations.and data-cache-refill mechanism can continue execution through 16 pending misses. the data cache supports a port to the data-transfer switch within the DataStreamer. variable-length decoding. and multimeand SIMD. and 32-bit addresses. and medicalimaging products. The architecture supports dynamic address translation and virtualmemory protection. addition. and general-purpose. MAPCA solves this problem by storing the instruction in a compressed format using the block header for compression information. each with an Integer-ALU (I-ALU) and an Integer-Graphic-ALU (IG-ALU). selecting maximums and minimums. and four special 128-bit registers. 2-D VF takes a 4:2:0 or 4:2:2 YUV stream as input and scales it up or down as required. MAP-CA chips support several audio/video interfaces. a media processor and memory-reference operations.656 input and output. and a synchronousDRAM (SDRAM) controller. MAP-CA coprocessors include a variable-length encoder/decoder (VLx). including inner-product with a new partition shiftin for efficient FIR operation and sum-of-absolute difwww. On-chip memories include a 32-kbyte data cache. logical. including selection. Each MAP-CA instruction contains four operations and structures nearly all operations in three-operand register-to-register format.com 68 edn | March 30. and most operation codes are predicated. No operations. videoconferencing systems. MAP-CA devices come with various coprocessors that offload the main CPUs. The 16-bit VLx RISC coprocessor has 32 16-bit registers and special hardware for bit-stream processing. bit integer ALU and a general-purpose signal and imaging-operation The architecture incorunit. a 32-kbyte instruction cache. an I2C selectable interface.ednmag. inner-product. The processor uses the 1bit logical values to support predicated execution. Equator partnered and a branch unit. The MAP-CA supports several on-chip memories and access to SDRAM and other memories via the 32bit. Native data types include 1-bit logical values. multiply-add. Each cluster The company claims has its own register file containing 64 the DSP yields good 32-bit registers that a program can code with a C comuse in pairs as 64-bit registers. The VLIW core executes four operations in parallel and supports partitioned single-instructionmultiple-data (SIMD) operations for 8-. a 6-kbyte linebuffer memory for VF. as well as integrated memory. It initiates transfers to and from memory. and Equator’s patented DataStreamer. for consumer The IG-ALU units contain a 32/64applications. Therefore. including multicycle SIMD operations. 33/66-MHz PCI bus. and 64-bit data types. dia/graphics operations. leading to wasted space in the processor’s cache or memory. DataStreamer is a programmable. The VLIW portion has a 32-kbyte instruction cache and a 32-kbyte data cache. a PCI interface. and an 8-kbyte buffer memory for the DataStreamer. a video filter (VF). flow-control. and sum-of-absolute difference. 2000 .dspdirectory 16 bits Equator Technologies MAP-CA The media-accelerated processor (MAP-CA) targets multimedia products. including ITUR BT. The MAP-CA comprises two clusters. each instruction consumes 136 bits. The I-ALUs perwith Hitachi to provide form integer. The CPU physically addresses both caches. 16-. which allows partial speculation and eliminates branches. floating-point. 16-. The eight-phase. In a typical VLIW instruction. These transfers are done under software control without consuming cycles from other on-chip processing units. The media-intrinsic operations include partitioned operations over these data types. The IG-ALU units perform inporates both VLIW teger. 32-. comparison. Equator’s MAP-CA is more than a DSP: It combines both microprocessor and DSP functions in a very-long-instruction-word (VLIW) framework. and 64-bit integers. a 32-bit integer ALU. 8-. The processor integrates key media-access I/O interfaces on the processor along with one 32-bit. and the compiler statically schedules all other operations. 64-channel. block instruction issue. Each I-ALU contains a load/store unit. a 4-kbyte data memory and 4-kbyte program memory for Vlx. The architecture uses register-scoreboard interlocks for load operations. and separate translation look-aside buffers provide virtual addressing for both. complex multiplication. some of the fields can contain nonoperation instructions. In addition to supporting the I-ALU ports. and IEC958 and I2S digital-audio interfaces.
ference with new partition shift-in for efficient blockmatching operations. cycle-accurate simulator. the FIRtree media-intrinsic C-language extensions. a linker. a profilingCPU simulator. and assorted libraries.com Month XX. a virtual-machine. www. 2000 | edn xx . an assembler. Support—Equator offers its iMMediaTools software developer’s kit. a source-level debugger. which includes a trace-scheduling Clanguage compiler.ednmag. MAP-CA supports the Microsoft Windows NT and Linux host-development environments. an assembly-level debugger.
com) simulator and debugger. The program can also write the results directly to the same memory location without requiring a temporary register (that is. These units include two instruction word. bit-.ednmag. a C compiler. These with-predication units can include four 16 16-bit MACs. You can customize instructions through CLIW and provide a high degree of parallelism with the ability to simultaneously generate four addresses and perform four arithmetic operations and two data transfers. block floating point. In addition. two 40-bit ALUs. nested to as many as four levels. 16 16-bit multiply-accumulate (MAC) units. or 40-bit operands. The Carmel DSP family supports conditional execution with the Carmel DSP-predication mechanism to avoid branch penalty and fast context switching with a register-bank-exchange instruction and a conditional execution-load instruction. Carmel DSP’s modified Harvard architecture has separate program.com 70 edn | March 30. Support—Infineon provides an integrated programdevelopment environment with uniform interfaces running under Windows. An RTOS facilitates task-level www. The hardware-looping mechanism enables zero-overhead loops. and short and long addressing modes. a 40-bit exponent unit. Special instructions—The Carmel DSP uses a 24-bit instruction that you can extend to 48 bits for wider operand selection. and pipeline-accurate. index-by-immediate. 16-bit operands. The Devices support con20XX can have as many as 10 computaditional executiontion units that operate in parallel. The first two members of the family are the 10XX and 20XX cores. Both cores can generate four independent 16-bit memory addresses.tasking. and a 40-bit exponent unit suparchitecture. the two MAC units perform two data moves on the 10XX. Also. a 40-bit barrel DSP has 40-bit internal shifter. which can accelerate Viterbi algorithms and saturation and limit support. The ALUs operate on 40-bit data and provide arithmetic and logic operations back-trace support. special instructions are available for square. 2000 . Both cores operate from 1. 32-. Carmel DSP’s CLIW architecture extends the traditional DSP instructions into VLIW capability through an additional 96 bits of the CLIW memory. a program can fetch operands directly from memory and route them to either of the execution units. which the program can use as a source or a destination. The Architecture features a 10XX has six computation units that opcustomizable longerate in parallel. and Tasking’s (www. and nearest and convergent rounding modes.dspdirectory 16 bits Infineon Carmel DSP 10XX and 20XX cores The Carmel DSP core is a family of licensable. In addition to accumulators.and data-memory space. The back-trace instructions accelerate the Viterbi-decoder implementation. Both cores support linear. and you must hand-code CLIW instructions into your program. larger immediate-operand fields. a 40-bit barrel shifter. The Carmel DSP configurable-long-instruction-word (CLIW) architecture delivers very-long-instruction-word (VLIW) performance without sacrificing power-dissipation and code-compactness requirements. ported by six 40-bit accumulators. two to four 40-bit ALUs. Addressing modes—The Carmel DSP family supports direct. and specialmodulo modification modes. and other accelerators for applications such as a 32-bit MAC for audio processing or quad 8-bit MACs for video processing. Algorithm libraries are available as both C and assembly-language routines for common DSP applications and functions. the MAC units can perform addition and subtraction operations. The Carmel DSP 20XX is binary-compatible with the 10XX and allows you to further modify and extend the core’s instruction set. limiting. cycle-. The exponent unit handles exponent detection and block floating point. the compiler provides no support for this function. This non-load-store design avoids the overhead typical of load-store architectures. index-by-register. indirect. However. The simulator is instruction-. fully synthesizable. The barrel shifter supports both arithmetic and logical shifts by 0 to 40 bits left or right and rotate right through carrying 16-. mechanism. Carmel DSP features four 16-bit data buses and a 48-bit bus for reading four operands and two 24bit instructions in a single cycle. The special-modulo mode enables you to stack memory buffers without the alignment usually associated with data buffers. which allow the Carmel to complete four additions or subtractions per clock cycle. and direct-operand references.2 to 2. bit manipulations. fractional and integer arithmetic. The application-specific CLIW instruction specifies as many as six parallel operations that can use the two ALUs. The registerbank-exchange instruction allows you to specify which registers to shadow. read-modify-write). fixed-point DSP cores. bit-reversal. modulo. saturation. division.7V and have low-power features. The software tools include an assembler/linker. minimum/maximum. 16-bit. All arithmetic/logic instructions support double precision. A programmer may split each accumulator into two 16-bit half-accumulators. logical and arithmetic shifts. The MAC units operate on all combinations of signed/unsigned.
debugging. www. You can also check out Infineon’s partner section at www.infineon. com/dsp. Infineon provides an evaluation/development board that supports JTAG-based emulation using Carmel’s on-chip debugging-support capability. These tools allow you to run programs in real time and within the application’s hardware system.com Month XX. 2000 | edn xx .ednmag.
software-stack support. and write results into a register file. You can also concatenate two 16-bit registers and perform singlecycle. The ZSP also has software support for bit-reversed addressing using a bit-reversing structure. pipeline without going through the write-back stage. LSI Logic’s DSP has hardware support for four nested looping constructs. indirect. It then dispatches the instructions to the individual functional units. This “C-friendly” architecture implements an “almost-orthogonal” instruction set. which LSI Logic acquired in 1999. is a superscalar architecture comprising synthesizable cores and standard products. eight-channel DMA unit. terrupt structure with programmable Special instructions priority levels. The ZSP’s functional units share a register file of 24 16-bit registers. Separate 64-bit instruction and data buses feed these caches from memory. 32-bit shifts on that register combination. The DSP core contains an integrated non-cycle-stealing. read. You can configure each channel to handle time-division-multiplexed data. The register file serves as a source and destination for multiply-accumulate (MAC) operands. and analog I/O interfaces. a high-priority event can support Viterbi. The core can simultaneously access only 16 of these registers. This support is possible because the DSP performs a single-cycle shift as large as 16 bits on each of the core’s 16 registers. The core of the ZSP comprises a five-stage pipeline: fetch and decode. This feature allows the CPU to execute a load and a load-dependent instruction in the same cycle. An instruction prefetcher operates in parallel with the instruction issue unit.to five-cycle latency. this latency is only three cycles when the target instructions are still in the cache. Development platforms include an integrated debugging environment. Results flow into a 40-bit accumulator. If the processor is processing a loop that is bigger than the 32-word cache. The ZSPs can issue as many as four instructions per cycle. grabs data from memory. flexible addressing modes. which operate in parallel. and register-to-register addressing modes. two voiceband codecs. The DMA unit has its own buses and operates independently of the processor. an instruction flips the bits. The ZSP provides mulDSPs use static-branch titasking support and uses a six-level inprediction. Zero-overhead looping requires an “again” instruction to indicate the end of the loop and to perform the counter decrement. making it easier for ZSP to develop custom instructions without affecting the datapath. Special instructions—The ZSP has add-compare-select and parallel-add and -subtract instructions for FFT and Viterbi. 2000 www. indexed. Result bypassing moves results from DSPs have a superfunctional units directly back into the scalar architecture. A mispredicted branch incurs a three. and specialized load instructions. execute operations. The DSP has hardware support for immediate or registercontent. One processing unit handles instruction scheduling. the ZSP’s prefetcher fills the cache before the instruction execution. a linker.ednmag. It also supports eight 32-bit or 16 16-bit barrel shifters.dspdirectory 16 bits LSI Logic ZSP DSPs The ZSP architecture. Addressing modes—The ZSP performs bit-reversed addressing in hardware. grabs code to be executed from memory. and a larger register file than that of traditional DSPs. A data prefetcher operates in parallel with the execution unit. Support—LSI offers a Gnu-based compiler. interrupt all instructions. Two 16-bit timers are also standard features of the DSP core. Pipeline control also performs result bypassing and interrupt processing. RS-232C and JTAG-based host communication and code downloading. and an assembler. The ZSP uses static-branch prediction. Its most prominent features are the dual multiply-accumulate (MAC) units and dual ALUs. An instruction can swap eight of these registers. The pipeline-control unit performs instruction grouping and resolves data and resource dependencies. The compiler supports new C fixedpoint data types and employs a variety of C-intrinsic functions. you should try to design your code so that program flow takes all backward predictions. It performs two 16-bit MACs or one 32-bit MAC per cycle.com . 72 edn | March 30. A 32-word instruction cache and a 48-word data cache are standard features. instruction group. The ZSP has hardware support for two circular buffers. which free programmers from the complexities of detailed pipeline management. and loads it into cache. and loads it into the cache. single-cycle bit-manipulation instructions. flash.
Addressing modes—The SC140 supports register-direct. the SC140 uses 2-. The StarCore architecture can DSP developed in also group multiple instructions into exconjunction with comecution sets. which Lucent and Motorola collaborated to develop. if conversion to exploit predicated execution. The program sequencer detects the parallel execution set. represents the foundation of many family members of DSP cores. the associated compiler. The first implementation of this architecture is the StarCore SC140 core. and increased orthogonality. as well as special address modes that use an immediate value to determine the data or the address of interest. They operate in parallel with other chip resources to minimize address-generation overhead. The SC140 contains 16 40-bit registers and 27 32bit address registers.ednmag. To further increase DSP architecture is flexibility and scalability. specifically. a scalable instruction-set architecture. and processes exceptions. The SC140 compiler also implements DSPmemory-specific optimizations. which the architecture expiler for better C proecutes simultaneously. and program-counterrelative addressing modes. global scheduling. modulo. the DSPs provide no architectural support. function inlining. Traditional very-long-instruction-word (VLIW) architectures use fixed-length instructions in a fixed-length execution set. The compiler performs a variety of optimizations. thus avoiding extra runtime overhead. which perform effective address calculations using the integer arithmetic necessary to address data operands in memory. Two registers in each AGU support software-stack operation: the normal-mode and the exception-mode stack pointers. address-register-indirect. performs loop and branch control. VLES allows variablelength instructions with no alignment restrictions. and the instructions and execution sets have memory-alignment restrictions. model that promotes scalable resources (such as multiple ALUs). such as modulo addressing. postincrement addressing. Relying on the compiler to encode the parallelism minimizes the silicon resources for instruction decoding and dispatching. and sophisticated loop analysis and transformations to exploit zero-overhead nest loops. ecution. www. The BFU can perform multibit or single-bit left/right shift. or explicitly parallel instruction computing (EPIC). You may add in one or two Variable-length instrucprefixes per instruction to accommodate tions yield code effiadditional registers and conditional exciency and parallelism. This architecture sometimes requires the compiler to insert nonoperating instructions to fill unused instruction slots. The AGUs implement linear. parallel. This device operates at 1. The CPU can perform as many as four parallel arithmetic operations. or tests a selected group of bits in a register or memory. Each DALU contains a multiply-accumulate (MAC) unit and a bit-field unit (BFU). or both in each cycle. logic operations. These registers can perform predecrement and postincrement updates. A program-control unit includes a program sequencer. The BFU contains a 40-bit. which Motorola has used in its MSC8101 DSP product. bit-field insertion or extraction. Another key StarCore feature. A key aspect of the StarCore architecture is its variable-length-execution-set (VLES). The SC140’s compiler statically builds execution sets comprising one to six instructions. The MAC contains a multiplier and an adder that can perform a 16 16-bit multiply and add the 40-bit accumulator to the result. count of leading bits. The SC140 can handle as many as four hardware-nested do loops. and a logic unit. bidirectional shifter with a 40-bit input and a 40-bit output. inverts. and several logical operations. in many cases. a maskgeneration unit. The SC140 also contains two address-generation units (AGUs). The SC100 incorporates a five-stage pipeline surrounded by four parallel data ALUs (DALUs). clears. The existence of two stack pointers in the SC140 eases support of multitasking systems and optimizes stack usage for these systems. buses and a 128-bit-wide program-data bus allow the core to fetch as many as two prefixes and six instructions per cycle. A separate bit-manipulation unit sets. can detect parallelism and group-independent instructions. multiple-wrap-around-modulo.dspdirectory 16 bits Lucent/Motorola StarCore SC100 The StarCore architecture. 2000 . including software pipelining. On the other hand.com 74 edn | March 30. or 6byte instructions. which fetches and dispatches instructions. The SC140 compiler implements “compiled stacks” to provide the functions of stack accesses at the cost of absolute addressing. and memory-access vectorization. Two 64-bit data gramming. Traditional DSPs do not efficiently handle runtime stacks.5V and 300 MHz and includes the company’s 32-bit RISC extracted from the popular PowerQUICC II. or addressing modes using the stack are less efficient than those using absolute addressing. A program can use the destination of every arithmetic operation as a source operand for the operation immediately following with no time penalty. 4-. the prefix “construct” to add features to instructions. StarCore uses scalable. They also contain the registers for generating the addresses. and reverse-carry arithmetic.
Special instructions—The SC140 multipliers support all combinations of signed and unsigned operands and both fractional and integer formats. absolute value. The MAC units also support division iteration. transfers between registers. and clear. and rounding.and C -compliant C/C compiler. and an ANSI C.ghs. a special version of the maximum/minimum operation that works with the Viterbi shift-left instruction. negate. The MAC units support add. a linker. The SC140 includes MAX2VIT. The compiler intrinsically supports for International Telecommunications Union/European Telecommunications Standards Institute primitives. Using these instructions. additions. an optimizer. a specialized move instruction that supports efficient implementation of Viterbi decoding algorithms. The SC140 supports a single-instruction-multiple-data version of maximum/minimum. 2000 | edn xx . SUB2) by treating values in registers as packed pairs of 16-bit data operands. maximum/minimum operations. a simulator. the SC140 can perform eight 16-bit additions or maximum/minimum operations per cycle.com) will also provide C/C support with its Multi development environment.ednmag. arithmetic-shift operations. Support—Tools include an assembler. and subtractions (MAX2. subtract.com Month XX. ADD2. comparison. Green Hills Software (www. www.
The YAAU has eight static registers and an adder. letting a programmer see data at any point. the chip incurs a one-instruction-cycle penalty and first performs data access. The other advantage of the instruction cache is in power savings. 76 edn | March 30. The DSP16xx uses fixed-point. The DSP defines two 64k-word address spaces—one for program coefficients and one for data.75V. Addressing modes—The DSP16xx has registerand memory-direct. imizes the number of registers to hold temporary data and. a DSP simulator model that plugs into system-level simulation tools from EDA vendors. miniFeatures include a 36mizes power consumption and die bit ALU/shifter. The company also provides a cycle-accurate model of Lucent’s DSP. and accumulating. This model allows you to develop your application at the system level using building blocks and to determine whether your design has the bandwidth to perform the task.dspdirectory 16 bits Lucent Technologies DSP16xx Lucent Technologies sells its DSP16xx-based products to the modem. All devices have interThe DSP’s X memory contains nal ROM. The shallow pipeline minimizes the impact of branches to two cycles.ednmag. decode. The main execution unit of the DSP16xx is the data-arithmetic unit. Memory writes always take two cycles. it has no bit-reversed addressing. coefficients. This method minbit multiplier. Both X and Y buses connect to the same dual-port RAM. including an assembler/linker. and the multiplier has registered inputs and outputs. the program uses special instructions to load an innerloop code block into a 15-instruction cache. if a DSP had program. bit-field extraction. a debugger. a simulator. The instructionstream pipeline has fetch. multiplying. compare. and four general-purpose 16-bit registers. and Y memories. X. A special address cycle allows both a read and a write to memory. The multiplier and adder operate in parallel. and data in parallel for high-throughput processing. Support—Lucent Technologies supplies a hardware-development system with an in-circuit-emulator pod. a 12-bit static-offset register. and immediate addressing modes. The DSP16xx family with its classic Harvard architecture uses three internal buses to move instructions. and four 16-bit pointer registers. (The DSP stores coefficients on the X side. Special instructions—Instructions for the DSP16xx include single/block-instruction hardware looping. Evaluation and demo boards are also available. register-indirect. compound addressing. size. loop processing. 2000 www. a compound addressing mode of MAC units. 2’s complement arithmetic throughout. and an application library. The bit-manipulation unit has a 36bit barrel shifter. an algorithm with few instructions would be an inefficient use of memory. therefore. If references occur simultaneously to both ports of a bank.and Y-memory address generators. wireless. and execute stages and runs in parallel with the MAC. The XAAU has a 12-bit adder. which has a 16×16-bit multiplier and a 36-bit ALU/shifter with 4 guard bits and two accumulators. The company sells software-development tools. and replacement. exponent detection. each with its own internal adders and registers to hold address values and offsets. The dual accumulators let you halve the number of memory accesses and thus are useful for tasks such as autocorrelation lags. Alternatively. and digital-telephony markets.7 for MACs. two 36-bit accumulators.) Also.com . However. Programmers can access XAAU and YAAU registers. Lucent offers a Linkable Functional Simulator. The multiply-accumulate (MAC) unit has a three-stage pipeline for fetching. conditional subroutine call. both program and coefficients and would typically become a bottleneck Devices operate at 2. The DSP16xx has no rotation instructions. the Y side points to memory that requires more pointers. The X side has half the number of registers as the Y side because signal processing typically requires fewer coefficient pointers. The MAC can shift the multiply result before running it through the ALU/shifter and into one of the accumulators. for fast innerto 4. The programmer controls the fetching data into the ALU and controls the Device has a 16 16multiply and add. The DSP has X. The DSP16xx has an exposed pipeline. bit shifting.
com 78 edn | March 30. The DSP can simultaneously send the result of an arithmetic operation into an accumulator and into the multiplier’s input registers. In other words. Other special instructions include rounding. and 40-bit operands.and 32 32-bit multiplies. immediate. three-input adder/subtracter allows a 40-bit addition or subtraction in parallel with another operation that uses the ALU. and fixed arithmetic. Although this core is source-code backward-compatible with Lucent’s DSP16xx core. An on-chip cache holds as many as 31 32-bit instructions. zero-overhead looping on only the innermost loop. a debugger. an assembler.535 times without overhead. Additionally. a linker. When you use a do instruction. these compare inDevice supports 16 structions can store trace-back bits as a 32. Special instructions—The DSP16000 supports a mixture of 16. extract. and so on. or a 40-bit. The DSP16000’s code and data transfers rely on a modified Harvard architecture with separate 20-bitaddress and 32-bit-data buses for the instruction/coefficient (X) and data (Y) memory spaces. 32-. you can perform cache-based. The C compiler (based on Gnu C) performs numerous local and global optimizations to produce optimized code. register-indirect. 40bit accumulators.and 32-bit instructions. 32×32-bit multiplies. The device’s ALU supports 16-. you must arrange the 16-bit operands as pairs in memory. so the BMU must perform shifts.and 32 32-bit side effect to support Viterbi processing. Hardware tools include an in-circuit emulator and a development board. you can use the compare instructions for determining the least common paths for Viterbi processing. Once the circuitry loads the loop into cache. You can use the data-arithmetic unit (DAU) to direct the multipliers’ outputs to a 40-bit ALU with addcompare-select (ACS) capability. The onchip X and Y memories each have a 32-bit datapath to the X and Y registers—an essential ingredient of keeping the dual MAC units fed. and allows more efficient compiler implementations. Addressing modes are oriented toward pointer arithmetic. 2000 . allowing the DSP16000 to perform as a three-bus architecture. The DSP16000 core performs two 32-bit fetches. These registers serve as four 16-bit register inputs to the two signed 16 16-bit multipliers. 32-bit datapaths come from the X and Y memories. The DSP16000’s trace-back encoder accelerates Viterbi acceleration and performs mode-controlled side effects for Viterbi compare instructions. minimizes code size. two multiplies. the data word at Address 1 goes into the other. The BMU performs operations such as ALU supports 16-. significant differences exist between the two cores. the cache-control circuitry loads the subsequent instructions into the cache as the pipeline executes them. three-input adder/ subtracter. parallel pipeline that begins with dual 32-bit registers. www. A redo instruction re-executes code that has already been loaded into the cache using the do instruction. The ALU supports 16-. The DSP16000 does not perform nested looping—that is. and two accumulates in one cycle. The DSP16000’s dual MAC architecture enables efficient. the 32-bit fetch results in data word at Address 0 goes into one multiplier. Addressing modes—The DSP16000 supports register. Because the device offers no bit-reversed addressing. Support—The DSP16000’s software tool set includes an ANSI C compiler. the core can execute the loop as many as 65. and register+displacement addressing modes. The separate have 32-bit datapaths.ednmag. The DSP-16000 can perform overflow saturation on the multiplier output and on the outputs of the three arithmetic units. insert. but the shifts X and Y memories take more than one cycle. negation. which minimizes a programmer’s need to swap values between memory and registers. In addition. The and 40-bit operands. Conditional execution of many instructions avoids branch penalties because branches take three cycles. and a simulator. mixed-precision. The dual MAC units on the DSP16000 are part of the DSP’s three-stage.and memory-direct. multiplies. 32-. The cache frees the instruction bus for X memory fetches. The side effects allow the DAU to store—without overhead—state information necessary for trace-back decoding. and 40-bit operands and can perform Device has dual MAC specialized compare instructions to acunits.dspdirectory 16 bits Lucent Technologies DSP16000 The DSP1600 has dual multiply-accumulate (MAC) units and supports 16 32. The DSP16000 supports two concurrent circular buffers. absolute value. The register file contains eight nonorthogonal. and rotate bits. The core supports as much as 1M words with a 20-bit address bus. software must perform this task. DSP16000 lacks a barrel shifter. Although the DSP can load the 32-bit X and Y registers in parallel with execution of one or two multiply operations. You must use this cache with a do-loop instruction. celerate minimum and maximum operations. Overflow saturation can also affect an accumulator value as program control transfers it to memory. This feedback loop avoids an explicit move instruction when you use the result as an input to a subsequent multiply. a 40-bit bit-manipulation unit (BMU). 32-. 16 32-bit and double-precision.
and on-chip cycle count. Cadence SPW. This feature enables you to view underused parts of the architecture and make code changes to increase code efficiency. Extensive on-chip debugging hardware lets you monitor many DSP16000s in real time. stand-alone or networked hardware emulation through the JTAG with the TargetView Communication System. Third-party tools. conditional assembly. software simulation with near real-time visibility. hardware trace. mixed assembly and C debugging. and the hardware tools cost $5000 to $7000. and allows mixed C and assembly code. The software tools cost $1500. and Mathworks Matlab support the DSP16000 simulator. The assembler also allows expressions to include multiple user-defined labels and supports the Tcl preprocessor directives to allow the assembler to share macro operations with the debugger. and various numeric-constant formats.’’ which provides a block-diagram view of the DSP’s multiple processing units. The debugger supports integrated debugging of one processor or multiple homogeneous or heterogeneous processors. extensive code profiling. the architectural-view utility graphically displays the data flow through the DSP. The assembler supports the ANSI C preprocessor to allow file inclusion. As you step through the instructions of your application code in the debugger. . It supports data and instruction breakpoints. macro substitution.emits information to enable C source debugging. such as Synopsys COSSAP. Another feature of the debugger is an “architectural view.
and www. data-move. The FILU-200 supports no immediate addressing. postincrement by N. The CU also includes the blockexponent unit (BEU).dspdirectory 16 bits Massana FILU-200 DSP coprocessor core Rather than create an entire DSP with all the bells and whistles of a stand-alone processor. and a Motorola MCore as host. an LCD interface. the improve FFT CPU must execute the three macroinperformance. which have 2 guard bits that allow your program to add two 20-bit numbers without worrying about overflow. The host views the FILU-200 as a memory-mapped peripheral. but it is postincrement or postdecrement only. If the executing cosine function to macroinstruction contains a jump. The FILU-200 can perform as many as 19 operations in parallel every cycle. The FILU-200 has a six-deep hardware stack. the compiler encodes the VLIW to fit into the width of the memory. an RS-232 interface. But a programmer typically stores the key DSP functions in program ROM. the BEU compares the exponent with the MX register and updates the value accordingly. The AGU includes the data addressing unit that provides linear. 2000 . The data RAM is 40 bits wide. as well as two barrel shifters that provide arithmetic shift (with saturation and convergent or nonconvergent rounding) and logical shifts. Every address register supports a linear-addressing mode. however. and bit-reversed increment and decrement operations. and radix-4 increment operations. Each DSP core comes with a runtime library of common DSP routines. The platform also includes a 16-bit stereo-audio codec. When the CPU executes from data RAM. radix-4 data. When the CPU executes from program or data RAM. a 20-bit internal datapath. data RAM. increment. The AGU also includes the twiddle-addressing unit that provides linear. The AGU provides dedicated hardware to support FFT-specific addressing needs. The CU implements two 22 16bit signed or unsigned MAC instructions. The AGU can perform as many as three parallel addressing instructions in each cycle.000 gates. the host-FILU interface provides the host with access to the three FILU RAMs and to the FILU control word. to start execution of FILU-200 functions. gram ROM. which detects the exponent of the largest number in an array of numbers. and a computational unit (CU). The AGU also indexes the sine/cosine ROM. as you would expect from a von Neumann architecture. When the FILU-200 is busy. An application-programming interface (API) allows the host programmer to access FILU RAM. When executing from ROM. host-processor reThe FILU-200 comprises a fetch-desources. that the host calls as a C-language routine. the CPU retrieves a very long instruction word (VLIW) from ROM every cycle. This retrieval provides the DSP with access to the parallelism of all the functional units. twiddle factor linear. but this encoding reduces parallelism. but it stalls the FILU-200. A sine/cosine unit for FFT and rotor applications looks up coarse and fine sine and cosine values in the FILU-200’s sine/cosine ROM.com 80 edn | March 30. it sets a busy bit in RAM. matrix and vector functions. allowing your program to contain as many as six nested subroutine calls. including FFT. the input data must be in order. The device pipelines host accesses to the RAMs to separate the critical paths on the host I/O and RAM I/Os. A radix-4 butterflyaddressing mode provides data and twiddle access. and the output data is in bit-reversed order. allowing you to store two 20-bit values to maintain full precision. data accesses conflict with program accesses. a prototyping area. Support—The FILU development system includes a 20-MHz FPGA implementation of the FILU-200. counter. an address-generation unit (AGU). fully synthesizable DSP coprocessor core that requires a host processor. postincrement by N. The FILU-200 is a dual-multiply-accumulate (MAC)-instruction architecture with dual barrel shifters. Massana claims that the FILU-200 is compatible with all RISC processors. which contains twiddle factors for the FFT operation. This approach allows the company to deliver real DSP capability and use only 30. Addressing modes—The FILU-200 supports direct addressing but only between address. structions in the pipeline before the jump takes effect. Before the CPU shifts the data stored in the accumulators through the barrel shifters. for writing data into memory. The interface uses a bus-request/grant or master/slave protocol to share the RAM. The host can access the RAM even when the FILU-200 is DSP subsystem uses busy. and 44bit accumulators. or the 96-bitDSP integrates a sine/ wide program RAM. FIR. The core contains a program-control unit (PCU). A loopcontrol unit provides zero. The architecture has 10 22-bit registers. Similarly.ednmag. the CPU simultaneously stores the Xn and Yn registers in RAM. It supports data-register-indirect and address-register-indirect modes. The DSP also supports zero-overhead block floating-point hardware. code-execute instruction pipeline that is maintained regardless of whether the All registers include CPU retrieves the function from proguard bits. and data registers. each of these operations must be from different functional units within the core. Massana developed a 16-bit. The data RAM is 40 bits wide and comprises two consecutive 20-bit signed words that the CPU loads into Xn and Yn registers. and decrement operations on all of the address and counter registers. IIR.
The AFE provides the ADC.to check the status of FILU-200 execution. DAC. 32-bit PCI interface. The card includes the FILU-DMT implemented in a 20-MHz FPGA. Massana provides a FILU-DMT. and a data-access arrangement.Lite DSP functions and a synchronous serial interface for the analog front end (AFE). a G. interpolation.Lite AFE using TI’s TLFD500 codec.Lite implementation on real-time hardware. cycle-accurate simulation of the FILU-200 in C. the C calls within the API initiate functions that perform operations that simulate the behavior of the FILU-200 hardware. and front-end analog filtering in a single chip. which enables a user to develop a “soft” G. decimation. During host-code development. which is a FILU-200 core with preprogrammed G. . a 33-MHz. Massana’s instruction-set simulator provides bit-true. The FILU-DMT is available on a PCI card. line drivers.
short and long memory-direct. Support—The 56800 uses Motorola’s Device operates at OnCE port for on-chip emulation 2. This package includes a 30-day evaluation license for CodeWarrior in the evaluation module (DSP56824EVM) and development system (DSP-56824ADS). true. and squaring. The ALU contains X0.7V and 70 MHz. seven memoryindirect. then the DSP performs a transfer from one register to another (for examArchitecture features ple. then the result can go directly back to one of the three input registers without corrupting an accumulator value. Additionally. a linker. Motorola added an addressing mode that requires no address calculation and allows direct access to the first 64 locations in X memory. such as parameter passing and local variables. a code simulator. addition. and the program controller.ednmag. as well as a set of logical and arithmetic operations. The 56800 supports an interruptible hardware do loop on an any-sized block of instructions. Metrowerk’s CodeWarrior. It also supports short-branch offset and modulo arithmetic for circular buffers. if you don’t expect the ALU result to be 36 bits. 2000 | edn 81 . and it can perform single-cycle multiply and MAC with optional rounding. To improve the performance of software looping. The AGU can provide two data-memory addresses with address updates in one cycle. a MAC unit. Using a conditional transfer instruction with a compare instruction implements searching and sorting alDevice mixes DSP with gorithms. allowing two data-memory accesses while fetching an instruction. If the specified condition is control functions. the address-generation unit (AGU). Single and dual parallel-move instructions perform memory accesses in parallel with ALU operation. You can write ALU results back to either of the accumulators. which can also serve as input registers. Furthermore. an offset register. a C compiler. to store the array index of the maxan interruptible hardimum value in an array). It contains five 16bit pointer registers (one functioning as a stack pointer). Addressing modes—The 56800 supports register-direct.com March 30. a 16-bit barrel shifter. and Y1 input registers. The general-purpose C-style instruction set with its flexible addressing modes and bit-manipulation instructions enables you to write control code without worrying about DSP complexities. The 58600 can perform bit-manipulation operations on any register or memory location. two accumulators. and immediate addressing modes. offers an integrated development environment. In a set of nested loops. an assembler. Special instructions—The 56800 performs hardware-do and -repeat looping on one instruction or a block of instructions. and a graphical source and assembly-level debugger for PCs. or a memory location to store the loop counter. which Motorola now owns. this approach makes the access faster than a long immediate access. through a standard JTAG interface. and two address ALUs (one supporting modulo arithmetic) to fetch two data items from memory every instruction cycle. you can perform the outer loops using software looping and the 56800’s data ALU register. a modifier register for circularbuffer support. The stack pointer has several addressing modes. a programmer generally uses hardware looping for the innermost loop. subtraction. Y0. www. the 56800 supports a decrement instruction that operates directly on X memory and uses a conditional branch operation. and automatic saturation logic. The data ALU provides single-cycle multiplies and multiply-accumulate (MAC) instructions with 36-bit accumulation (4 guard bits). ware do loop. improving compiler performance and supporting structured programming techniques.dspdirectory 16 bits Motorola DSP56800 The DSP family’s parallel-instruction set controls three concurrent execution units: the data ALU. Then. AGU register.
although instruction pipelining may require an extra cycle for some instructions. which is basically a multiport register file. The 7701x devotes 64 words of internal instruction RAM to interrupt vectors. www. A programmable waitstate generator allows you to divide each of the three external memory spaces into four regions and control the wait states of each region. 1-bit shift-multiply-add. libraries is also available from NEC. and a peripheral set. the general register set provides the data to drive the MAC subunits. Memory-read and -write accesses can take a single cycle. The 7701x comprises a data unit.You can configure each of these serial channels as most.and Y-memory ASIC technology.and Y-data components for each MAC cycle.or least-significant-bit-first data format. NEC also offers a C compiler for the PD7701x.com 82 edn | March 30. eight 40-bit general-purpose regDevice features three isters. which simulates I/O through timing files that you write in a programming language. a starter kit. Through this bus. and the DSP services them on a fixed-priority basis. Y. 2000 . orthogonal registers. Two two-word RAM data-memory banks supply the X. or by an immediate value.ednmag. and has ported the SPOX real-time OS to 7701x DSPs. Each unit also has a modulo register for circular buffering. Each main unit connects to an external data-memory interface: 14-bit addressMAC unit provides 8 es and 16-bit data for the data unit. The 32bit instruction word helps to increase code efficiency by allowing a variety of operation parallelism. output registers. You can nest zero-overhead loops of as many as 255 instructions as many as four instructions deep or deeper if you use software to save and restore the stack. A words and 32-bit intransfer bus links the two main units.m data unit contains X. You must directly load internal RAM under program control. 40-bit register/accumulators. guard bits with no The MAC unit has three 40-bit parautomatic saturation. You can independently enable interrupts. The PD7701x has dual external-memory ports— one for 16-bit data and one for 32-bit programs—with two distinct 16k-word address spaces for data. a 40-bit ALU. The cores feature a pipelined multiply-accumulate (MAC) unit. Each bank has its own address generator with a set of four address-pointer registers. Autoincrement and autodecrement can be by one. A special bit-reverse circuit handles bit-reversed addressing for each bank. You can configure each of these for interrupt servicing or polling and to transfer 8. Additionally. the general register set. and immediate addressing. Unit has 16-bit data and instruction-decode/control logic. floating-point normalization. and eight general-purpose. Each unit also has an index-register link to the main data bus. or software. which can execute concurrently. In effect. and a MAC-execution unit.and 0. Instead. so you can code a short interrupt handler within the vector itself. register-indirect subroutine call. allel subunits: a multiplier. by the value stored in special registers. NEC also supplies the SM77016 simulator. struction words. register-indirect. Support—NEC supplies the WB77016 Workbench assembler/linker/loader package.dspdirectory 16 bits NEC SPRX DSP NEC’s PD7701x. serves as the interchange that links the data to the execution side of the processor. Special instructions—The 7701x supports conditional operations to minimize jumps. The core uses the X. also includes ASIC macros that integrate SPRX cores plus memories. Circular buffers are of and a 40-bit barrel shifter. This simulator allows you to control I/O details. The 7701x lacks bit-manipulation instructions. and double precision. the MAC trary increment subunits have no dedicated input and amounts.and block-instruction hardware loop. repeat. family of DSP cores.25. Each interrupt handler comprises four instruction words. and host-CPU interfaces. register-indirect jump. such as inserting data into or extracting data from a running simulation. The 0. interfer buses. Hardware supports modulo and bit-reversed addressing for each data memory. which together can total as many as 15 calls deep. and transfer buses to load data into the general register set. A PC-based plug-in development board offers in-circuit emulation using the 7701x’s on-chip emulation features. or SPRX. parallel register load/store. rupt-control logic. You can also nest single-instruction repeats within any of these loops. units. a barrel shifter. the MAC is tightly integrated with a set of eight general-purpose. The internal buses for Xprogram unit contains the instructionand Y-data and transaddress unit with loop control.or 16-bit data. clip result. Unlike many arbitrary size and arbiother DSP implementations.35. zero-overhead single. A variety of middleware. program memory. Addressing modes—The 7701x supports memorydirect. code can load and modify the pointer and modification registers. a Cores are for NEC’s program unit. The 7701x also supports an 8-bit-wide host interface and two serial channels. each of which has an address generator. The program stack supports both call-return and interrupt nesting. serial interfaces. the DSP hardware supports automatic interruptible looping with a fourlevel loop stack that lets code nest so that it can loop under hardware control.
The assembler handles mapping of the AXU commands. ers and two 40-bit ALUs. which activates the corresponding ASI operations. and instructionset simulator account for the ASIs that you specify when writing your application program. but you can extend the instruction set with 96-bit ASIs.ace. your application can download sets of ASIs to the chip while the application is running and dynamically customize the DSP core. you may split each ALU into two independent 16-bit ALUs. This dual multiplier/accumulator architecture can simultaneously calculate two independent output values. Support—Philips Semiconductor supports DSP-C.ednmag. An AXU uses the standard DSP resources. Each ALU in the DCU handles 32-bit data and 8 overflow bits and stores results in four 40-bit accumulators. The company is developing a C compiler for the REAL Data-computation unit architecture using the Associated Comcomprises two 16 piler Experts’ (www. The assembler. if necessary.nl) CoSy com16-bit signed multiplipiler-development platform. but the architecture is gaining popularity among external customers. each of which has eight address pointers that perform automatic context switching during interrupts. although algorithm-specific memory bottlenecks may require the designer to reuse the same coefficient data for both calculations. as well. a proposed extension of ISO/ANSI C to better handle DSP-specific capabilities. such as its Genie phone and a digital telephone-answering machine. Users can define AXUs or select them from library modules. www. downloads them to the DSP. translates the instructions to an ASI look-up table and. Philips reserves a few bits in the application-specific-instruction (ASI) bit patterns to control AXU hardware. DSP has dual Harvard such as fixed-point data types and dual architecture. If the silicon implementation of the DSP’s look-up table is in RAM. The REAL DSP has two independent address-computation units (ACUs). Harvard architectures. and hooks in the DSP instruction decoder enable control of an AXU. such as a 40-bit barrel shifter. Special instructions—The REAL DSP’s instruction set uses 16-bit operation codes. REAL features a dual Harvard architecture with two 16-bit data buses connected to its data-computation unit (DCU). The assembler/linker then checks for duplicate ASI instructions.com March 30. An on-chip RAM or ROM look-up table contain these very-long-instruction-word-like ASIs. The REAL DSP’s VHDL-synthesis model allows designers to add application-specific execution units (AXUs) at specified points in the datapath or the ACUs. which contribute to a high level of parallelism. a normalization unit. Device is customizable using applicationspecific units and instructions. You may use as many as 256 ASIs. You have to specify the keyword “asi” followed by all the operations that the core executes in parallel. Alternatively. A 16-bit instruction entering the core contains an index of this table.dspdirectory 16 bits Philips REAL DSP Philips invented the REAL (reconfigurable embedded DSP architecture low-power/low-cost) DSP architecture for use in internal company products. 2000 | edn 83 . and a division-support unit. linker.
These successive-approximation converters convert an analog signal in as little as 500 nsec and have eight or 16 multiplexed input channels. without overflowing the accumulator. the C2xLP can access 64.com 84 edn | March 30.ednmag. TI’s $1995 Code Composer for the PC includes an ANSI C compiler. The C2xLP core keeps the same four-stage pipeline of the C5x. In addition. an integrated. Most of the devices on the C2000 platform can generate zero to seven wait states. are mapped in either the on-chip data or the I/O spaces. multiply and accumulate previous product. The C20x targets lowend telecommunications. which allow as many as 16 conversions in a session. Support—TI offers Code Composer 4. the program counter auate as low as 3. The C2000 also offers single-instruction repeat. a linker. unified development environment that supports editing. and real-time analysis and data visualization. The www. Some C24x devices also allow you to program the flash at the sector level. which feeds the 32-bit accumulator. accumulate previous product and move data.000 16bit parallel I/O ports. and block move. instead of the older. The devices in the C24x generation have an onboard event manager to support motor-control applications. The event manager supports symmetrical (centered) and asymmetrical (noncentered) PWM-generation capabilities. the two families differ in their memory and peripheral mixes. It also has a JTAG-emulation block like the C5x. Your program must use other I/O addresses to access off-chip peripherals mapped in the I/O space. in which 7 bits in an instruction concatenate with a 9-bit data-page pointer to access data RAM. the accumulator. freeing the program bus to fetch the second operand. multiply and subtract previous product. This feature allows the MAC unit to achieve single-cycle execution.10. the event manager integrates four capture inputs. the CPU shifts the values to the next memory location. The product-scaling shifter allows supports two separate as many as 128 product accumulates bus structures. multiconditional branches and calls. tomatically increments. profiling. store long immediate to data-memory locations. The basic multiply-accumulate Dual-access RAM al(MAC) cycle involves multiplying a lows writes and reads data-memory value by a programto and from the RAM memory value and adding the result to in the same cycle. The C2xLP offers no circular buffering. It also supports a spacevector PWM state machine. MACD is an alternative to using a circular buffer and is useful for convolution and transversal filtering. The C24x family also integrates on-chip.3V. TI offers an emulator that supports JTAG scanbased emulation for nonintrusive product test. The event manager features three up/down timers and nine comparators. The accumulator also acts as one of the inputs to the CALU. such as serial ports and software wait-state generators. debugging. an assembler. It also supports registerindirect addressing using the 16 bits in one of eight auxiliary registers to access memory. The C2000 family comprises the C20x and the C24x product families. as the CPU uses the input data values. The peripherals on C2000 devices. Software can rotate the contents of the accumulator through the carry bit to perform bit manipulation and testing. Several devices in the C24x family include flashmemory arrays ranging from 8k to 32k words. which implements a scheme for switching power transistors to yield longer transistor life and lower power consumption. Addressing modes—The C2xLP supports immediate and paged-memory-direct addressing. an instruction-set simulator. When the C2000 reFamily members operpeats the MAC. building. Special instructions—A MAC-with-data-move instruction (MACD) adds a data move for on-chip RAM blocks to the MAC unit. and an independent sample/hold prescaler. A deadband-generation unit also helps protect power transistors. 10-bit A/D converters. which you can couple with waveform-generation logic to create as many as 12 PWM outputs. and project management. and the C24x targets digital-motor control. rotate accumulator left/right. 2000 . which allows it to operate as fast as 40 MHz. two of which can serve as direct inputs for optical-encoder quadrature pulses. which gives you greater flexibility by supporting a range of input impedances. For implementing fractional arithmetic or justifying a fractional product. Similar to the C5x. the C2xLP processes the product-register output through a product shifter to eliminate the extra bit in a multiplicaHarvard architecture tion.dspdirectory 16 bits Texas Instruments TMS320C2000 TI based the TMS320C2000 DSPs on the 320C2xLP core that the company offers as part of its custom DSP capability. It can automatically postincrement or decrement auxiliary registers. The other input to the accumulator comes from either the 16 16-bit multiplier through a scaling shifter or the input data-scaling shifter. You can use slower external memories using the on-chip software wait-state generator or the chip’s Ready pin. The C2xLP has a central arithmetic-logic unit (CALU). in-circuit emulation of the C2x. Some of the newer C24x devices also offer autosequencing capabilities.
and application algorithms are also available from third parties. and an application library. prototype cards. See www.company also supplies a C compiler. a simulator. emulators. a linker. a source-level C assembler/debugger.ti. which you can combine with the C2000 DSPs. Evaluation modules. . TI also offers analog devices. a profiler.com/sc/4123 for more details. such as data converters and a power-management supply.
the C54x wide on the C55x and 16 bits wide on the has one MAC unit. and a separate 40-bit ALU. The address-data-flow unit C54x. the C54x’s ALU features a dual 16-bit configuration that enables dual single-cycle operations. The DCU in the C55x contains dual multiply-accumulate (MAC) units that perform two 17 17-bit MAC operations in a single cycle. Both architectures support a single barrel shifter that can shift 40-bit accumulator values as much as 31 bits to the left or right. Each MAC unit comprises a multiplier and a dedicated adder with 32or 40-bit saturation logic. The C55x has 12 independent buses compared with eight for the C54x. Within the ADFU. Both architectures include one program bus and an associated programaddress bus. The local repeat instruction uses the instruction-buffer queue to repeat or loop a block of code. or the PFU registers.com 86 edn | March 30. and the C54x bus is 16 bits wide. but the C55x takes power to a new level: TI claims that a 300-MHz C55x delivers approximately five times higher performance than and dissipates one-sixth the core power of a 120-MHz C54x. and dedicated Viterbi algorithm hardware (commonly used in error-control-coding schemes and also available with the C54x). The C55x and C54x DSPs use a modified Harvard architecture. four 40-bit accumulator registers. The instruction-buffer queue can also perform speculative fetching of instructions while testing a condition for conditional program-flow-control instructions. and the C5x DSPs are source-code-compatible with the C2x DSPs. the C55x instructions range from 8 to 48 bits. The C54x is a single 17 17-bit MAC machine with a dedicated 40-bit adder. www.and three-operand instructions and some 32-bit operands. 16-bit registers or any of the address-generation registers. htm#tms320c5x or the handy online comparison sheet. the company is shifting new designs toward the C54x and C55x. The C55x bus is 32 bits wide. The four internal buses and dual address The C55x has 12 buses generators enable multiple-operand opversus eight for the erations. fixed-point DSP-product family includes the older generation C5x. This approach results in predictable execution time. This parallelism allows the DCU to read two 16-bit operands and a 16-bit coefficient during each CPU cycle. Although the C5x is in full production. two 40-bit accumulators. Similar to the C55x. Single-cycle normalization and exponential encoding support floating-point arithmetic. The The C55x has dual corresponding address buses are 24 bits MAC units. Either the ALU or one of the three addressing-register ALUs (ARAUs) can modify the nine addressing registers used for indirect addressing. The C55x DSPs are sourcecode-compatible with the TMS320C54x DSPs. es. please refer to www. C54x. You can use the ALU to operate on 32-bit data or split it to perform dual 16bit operations. whereas the C54x has two datapoint device. For more information on the C5x. This ALU accepts immediate values from the instruction-buffer unit (IU) and communicates bidirectionally with memory. two auxiliary register-arithmetic units. and the program-flow-unit (PFU) registers. The instruction sets support the parallelism of the architectures with many two. a barrel shifter. 2000 . The three data-read buses can carry two data streams and a common coefficient stream to the two MAC units. With the ability to execute variable-length instructions. The ADFU also provides an additional general-purpose. the mainstream C54x. The instruction decoder decodes instructions in sequential order rather than performing dynamic scheduling. Whereas the C54x instructions are fixed at 16 bits. In addition to accepting inputs from the 40-bit accumulator registers of the DCU. It also contains a 40bit ALU. 16-bit ALU with shifting capability for simple arithmetic operations. the ALU accepts immediate values from the IU and communicates bidirectionally with memory. the ADFU registers. The barrel shifter can supply a shifted value to the DCU’s ALU as an input for further calculation.ednmag. Eight individually addressable auxiliary registers and a software stack aid a C compiler’s efficiency. and the new C55x. The C55x has variable The C54x can generate one or two instruction size and no data-memory addresses per cycle using alignment restrictions. read buses and one data-write bus.com/ednmag/reg/1999/041599/08cs_16. Each data bus also has its own address bus. fixedbuses. ednmag. the data-computation-unit (DCU) registers. The C54x focuses on low power consumption. (ADFU) on the C55x contains dedicated The devices operate at hardware for managing the five data bus0. the ADFU registers. The three ARAUs provide independent address generators for each of the C55x’s three data-read buses. The 40-bit adder at the output of the multiplier allows unpipelined MAC operations as well as dual addition and multiplication in parallel. The C55x’s IU buffers 64 bytes of code in a queue and includes the decoding logic that identifies the instruction boundaries of the variable-length instructions. the C55x takes a significant deviation from the C54x. the ALU can manipulate four general-purpose.dspdirectory 16 bits Texas Instruments TMS320C5000 This 16-bit. The C55x has three data-read buses and two data-write DSP is 16-bit.9V at 300 MHz.
standards for application interoperability and reuse. maintaining register or memory contents. such as FIR filters. but the C55x core can also actively and automatically manage power consumption of on-chip peripherals and memory arrays. When you disable a domain. Peripherals can enter low-power states when they are inactive and respond to processor requests and exit their low-power states without latency. and the clock-generation circuitry. Special instructions—The C54x performs dedicated-function instructions. 2000 | edn 87 . When such data hazards occur in a C55x instruction stream. and the direct-addressing.com/sc/docs/general/dsp/expresssp/ index. they switch into a low-power mode. the peripherals. or displacement. When a program is not accessing individual on-chip memory arrays. Code Composer Studio. and a growing base of TI DSP-based software modules from third parties (www. register-indirect addressing. DSP/BIOS. and asynchronous SRAM and DRAM. an integrated suite of DSP-software-development tools. The C54x supports two circular buffers of arbitrary lengths and locations. Addressing modes—The C54x supports single datamemory-operand addressing that also supports 32bit operands. The PFU handles pipeline-control hazards and provides protection against write-after-read and read-after-write data hazards. On initial C55x devices. circular. The C55x also has special instructions that take advantage of the additional functional units as well as increase parallelism capabilities. When you enable the domain. Your program can simultaneously use as many as five independent circular buffer locations with as many as three independent buffer lengths. It provides immediate. Thirdparty tools and application algorithms are also available. See www. the C55x supports absolute addressing. the pipeline-protection logic inserts cycles to maintain the intended order of operations and correct execution of the program.com March 30. A PLL allows you to throttle the clock. It also includes hardware to support conditional repeats. These domains are sections of the device that you can selectively enable or disable under software control. User-defined parallelism allows you to combine certain instructions to perform two operations.com/sc/4123 for more details. conditional execution. it returns to normal operation. the instruction queue. parallel store and multiply accumulate). The C55x’s ADFU includes dedicated registers to support circular addressing for instructions that use indirect addressing. multiply and accumulate and subtract (10 multiply instructions). single and block repeat. The processor provides similar control for on-chip peripherals. the external memory interface. and eight dual-operand memory moves.ti. incorporates TI’s C5000 C compiler with the Code Composer integrated development environment. which parallel instructions use. low-power Idle domains. the Idle domains are the CPU. These circular buffers have no address-alignment constraints. It also supports dual-data-memoryoperand addressing. the DMA. and pipeline protection.ti. This unit includes hardware for looping and for speculative branching. In addition to the C54x modes. You can also combine a built-in parallel instruction with a user-defined parallel instruction. eight parallel instructions (for example. The C55x has expanded options for synchronousburst RAM. mode.dspdirectory 16 bits The C55x has a PFU that tracks a program’s execution point and generates the 24-bit addresses for instruction fetches for as much as 16 Mbytes of program memory. An integrated software wait-state generator in both DSPs allows you to use slower external memories. A separate program counter is dedicated to fast returns from subroutines or interrupt service routines. synchronous DRAM. it enters the Idle state. and Real-Time Data Exchange technologies. www. and bit-reversed addressing. memory-mapped. The PFU also includes the logic for managing the instruction pipeline and the four CPU status registers.ednmag. real-time software foundation.htm). a scalable. The PFU provides three levels of hardware loops by nesting a block-repeat operation within another block-repeat operation and including a single repeat in either or both of the repeated blocks. All devices support on-chip dual-access RAM (DARAM) that you can configure as data or program memory. The C55x also implements user-controllable. Support—The eXpressDSP software-technology strategy includes DSP integrated development tools.
on the C62x. but the single datapath is 256 bits wide. the C62x has 16 32-bit registers per bank. In the C64x. instruction latencies. or all the functional units on a side in a VLIW-execute packet may use the cross-path operand for that side. The C64x register file supports all the C62x data types. added bidirectional variable shifts to the M and S units. In all C6xxx devices. The C64x D unit can perform 32-bit logical instructions in addition to the S and L units. C compiler is For example. This architecture includes the fixed-point C62x. In addition. A 32-bit address bus supports these datapaths. A program can use the general-purpose registers for data. or parallel/serial combinations. The C64x data-cross-path accesses allow multiple units per side to simultaneously read the same crosspath source. eight serial instructions require the same code size as eight parallel instructions. the floating-point C67x. The C6000 lacks a dedicated multiply-accumulate (MAC) unit. tractions with absolute value.and Ymemory spaces. Thus. therefore. up capability in other functional units. or condition codes. The programming 88 edn | March 30. In the C62x. TI calls this approach a fetch packet. Instead. The C62x architecture does not allow fetch packets to cross fetchpacket boundaries. but data dependencies.) This architecture comprises dual datapaths and dual matching sets of four functional units. A 32-bit address bus addresses the program memory. data-address pointers. Bit-count and rotate hardware on VLIW architecture the M unit extends support for bit-levhas RISClike el algorithms. This approach removes execute-packetboundary restrictions and eliminates all filler NOP instructions. Packed data types store four 8-bit values or two 16-bit values in a single 32-bit register or four 16bit values in a 64-bit register pair. Instead. The C62x register files support packed 16-bit data through 40-bit. for loading data from memory to the register banks. any member of a functional-unit set can access the other functional- unit set’s register bank. The C64x also has beefedcharacteristics. You can store values larger than 32 bits in register pairs. one. The L and D units can load 5-bit constants in addition to the S unit’s ability to load 16-bit constants.com . only one functional unit per datapath per execute packet could access an operand from the opposite register file. Two other 32-bit paths (64 bits for C64x) store register values to memory. packed 8-bit types. and the new C64x families. A program can use any register as a loop counter. the C62x requires alignment on 32or 64-bit boundaries. the logical (L) units can closely tied to the perform byte shifts and quad 8-bit subarchitecture.and 32-bit paths. compared with TI expects the first one multiply on the C62x. The C64x is object-code-compatible with the C62x but with significant architectural enhancements and an initial operating frequency of 750 MHz. TI also deliver parallelism. The compiler and assembly optimizer play big roles in establishing the sequence of instructions for the C6000 to execute. the pipelined effect yields apparent single-cycle execution. the functional-unit set performs this procedure through one data bus. resulting in compiler-generated nonoperation (NOP) instructions to pad fetch packets. The CPU can execute one to eight instructions per cycle. and resource conflicts limit optimal performance. they provide a single data memory with two 64. you can use registers A4 through A7 and B4 through B7 for circular addressing. fixed-point and 64-bit. each functional-unit set has its own bank of 32 32-bit registers. eight 32-bit instructions per cycle. C64x device to hit 750 each M unit on the C64x can perform MHz. This width allows the C62x to fetch. This absolute-difference instruction benefits Eight functional units motion-estimation algorithms. all units except the two data units have a data cross-path to the other set of units. so an extra write port is available from the multipliers to the register file. The C64x M units perform two 16 16-bit multiplies every clock cycle. Although this operation requires two instruction cycles. The C6000 families support no separate X. four 8 8-bit multiplies every clock cycle. respectively. 2000 www. so the C67x instruction set is a superset of the C62x instruction set. but not necessarily execute. multiple. TI created the C67x by adding floating-point capability to six of the C62x’s eight functional units. (Unless otherwise indicated. Multiple execute packets allow fully parallel. The eight functional units on the C64x and C62x comprise two multiply (M) units and six 32-bit arithmetic units with a 40-bit ALU and a 40-bit barrel shifter. it performs MAC operations by using separate multiply and add instructions.dspdirectory 16 bits Texas Instruments TMS320C6000 TI’S TMS320C6000 IS A GENERAL-purpose DSP based on a very-long-instruction-word (VLIW) architecture. On the C64x. The C64x architecture resolves this “code-bloat” issue with instruction packing in the instruction-dispatch unit. and 64-bit fixed-point data types. fully serial. all C62x details that follow also apply to the C67x. The C64x can also access words and double words at any byte boundary using nonaligned loads and stores. Each C64x multiplier can return a result as large as 64 bits. which can free the standard-condition registers for other uses.ednmag. floating-point data.
Special instructions—All C6000 processors conditionally execute all instructions. Another instruction helps reduce the number of instructions needed to set up the return address for a function call. TI also added instructions that operate directly on packed 8.com/sc/docs/tools/dsp/ 6ccsfreetool. respectively.dspdirectory 16 bits tools link instructions in a fetch packet by the least significant bit of an instruction. Support—The eXpressDSP software-technology strategy includes DSP integrated development tools. The C64x’s the branch-and-decrement (BDEC) and branch-on-positive (BPOS) instructions com- bine a branch instruction with the decrement and test positive of a destination register. However. allowing the CPU to perform concurrent accesses to both ports. The C64x provides data packing and unpacking operations to allow sustained high performance for the quad 8-bit and dual 16-bit hardware extensions. The C6202. Therefore. shift. If the bit is set. Addressing modes—The C6000 performs linear and circular addressing. such as binary morphology. and four of the six remaining functional units.ti. including saturation support. The second bus for I/O devices reduces the loading on the EMIF and increases data throughput. unlike most other DSPs that have dedicated address-generation units. inline assembly code. Third-party tools and application algorithms are also available. Two devices from these families. The assembly optimizer simplifies assembly-language programming and automatically schedules and parallelizes instructions from serial. www.htm). The dual 16-bit arithmetic combines with six of the eight functional units and a bit-reverse (BITR) instruction to improve FFT cycle counts by a factor of two. The assembly optimizer performs dependency checking and parallelism among instructions.com/sc/docs/general/dsp/expresssp/ index. All functional units can perform dual 16-bit addition/subtraction. real-time software foundation. therefore. a scalable. and C6204 have a 32-bit expansion bus that replaces the 16-bit host-port interface and complements the external memory interface (EMIF). Unpack instructions prepare 8-bit data for parallel 16-bit operations. Bit-count and rotate hardware on the M unit extends support for bit-level algorithms. average.htm.com March 30. Pack instructions return parallel results to output precision. standards for application interoperability and reuse. minimum/maximum. the C6000 executes the instruction in parallel with the subsequent instruction. The EMIF and the expansion bus are independent of each other.6 on a per-clock-cycle basis for an 8 8-bit minimum-absolute-difference (MAD) computation. the MPYU4 instruction performs four 8 8-bit unsigned multiplies. the C6000 calculates addresses using one or more of its functional units. image-metric calculations and encryption algorithms. The Galois field-multiply instruction (GMPY4) provides a performance boost over the C62x for Reed Solomon decoding using the Chien search. are the industry’s first DSPs with L1 and L2 onchip cache memory. Free tools are available for a 30-day trial on the Web at www. The C6211 incorporates a twolevel cache structure with 4-kbyte Level 1 program and data caches. The ADD4 instruction performs four 8-bit additions. such as out-of-order execution or dependency-checking hardware. and bit-expansion operations. and absolute-value operations. an integrated suite of DSP-software-development tools. and a growing base of TI DSP-based software modules from third parties (www.ti. keeping the pipeline flowing. TI offers hardware-emulation boards and starter kits.ti. The assembler reads straight-line code without regard to registers or functional units and does the resource assignment. The M units. The C6211 also includes a 16-channel DMA controller that tracks 16 independent transfers and allows you to link each channel to a subsequent transfer. The debugger performs code profiling to determine the amount of time the processor spends in various portions of the code. compare. the C6211 and C6711. 2000 | edn 89 .and 16-bit data. The internal Level 2 cache memory is a unified 64-kbyte data and instruction RAM. Deterministic operation allows the debugger to lock-step through the code. The quad-absolute-difference instruction bolsters motion-estimation performance by a factor of 7. C6203. and Real-Time Data Exchange technologies. See www. Special average instructions improve the performance of motion compensation by a factor of seven on a per-clock cycle basis versus the C62x. the code executes as programmed on independent functional units and eliminates the need for core features.com/sc/4123 for more details. On the C64x. compare. minimum/maximum. a method of reducing branching and.ednmag. support quad 8bit addition/subtraction. incorporates TI’s C6000 C compiler with the Code Composer integrated development environment. The Code Composer Studio. DSP/BIOS.
Using the circular buffers. A key feature of both cores is the memory-to-register-file architecture. therefore. which allows the programmer to perform 2-D matrix operations. and a cycle-by-cycle multiplexed system bus.and 32 16-bit multipliers. static DSP cores in VHDL or Verilog. bit extraction. respectively. The cores incorporate a five-stage pipeline that handles all data dependencies and hazards in hardware and issues a nonoperation instruction when the data is unavailable in time. If the CPU doesn’t resolve the branch condition. two 16 16-bit. Eight circular buffers support postincrement or postdecrement indirect addressing.You can add application-specific instructions to optimize performance. as well as a 32-bit. the SP-5 includes an extra MAC and an extra saturate/add unit. synthesizable. a cycle-accurate simulator. two 16-bit. respectively. and 32-bit speculative branch as soon as it deterdata. It also includes multiprocessor support and a variabledepth buffer to maximize both on-chip systems. The page registers provide a convenient way to index array-data structures. mines the branch condition. respectively. If the operand is unavailable Architecture is dualin the register file but available in the issue superscalar. chained transfers. superscalar SP-5. The baseline SP-3 and SP-5 cores have two and four 48-bit accumulators. and dynamicsupport matrix branch resolution to minimize branch operations. The circular buffer can also access multiple variable blocks of memories with no additional setup requirements. It has a www. and both can generate two new addresses for the next cycle. pipeline. byte packing.com 90 edn | March 30.ednmag. or 16-bit saturation. the SP-3 and SP-5 CPUs can read as many as two and four data operands. the decoding logic stalls the instruction until it is available. Configuration options for the SP-3 and SP-5 multipliers. penalties.dspdirectory 16 bits 3DSP SP-3 and SP-5 DSP cores 3DSP. and infinite transfers. Addressing modes—The SP-3 and SP-5 have four page registers to support register-indirect addressing. This controller includes support for 2-D data transfers. and 32 24. and bit insertion. This approach eliminates unnecessary memory fetches and. The SP-5 can execute two instructions in a single cycle. can perform a maximum of two and four load/store operations every clock cycle. They can perform two or four address calculations per cycle. respectively. The circular buffer supports transpose mode. and a logic unit. 32-. The CPU fetches no instructions if the pipeline is full or if a cache miss occurs. 2000 . Special instructions—Huffman-encoding-related instructions target image-compression applications. These accumulators combine with the MAC units to handle 24 16-bit. The DSPs conditionally execute all instructions. reduces power consumption. a 600-Mbyte/sec. fetch one and two 32-bit instructions from the instruction cache every clock cycle. 24-. or one 32-bit operation. Support—The vendor’s development-tool set includes an optimized C compiler. The logic unit performs various logic functions as well as leading-one detection and compare/select. or four 16 8-bit MAC instructions. The saturate/add units perform 48-. The SP-3 includes a multiply-accumulate (MAC) unit. If an operand is unavailable for execution. a venture-capital-funded company that started up in 1997. from memory in a cycle. You can use the bit-reverse mode in the circular buffer for FFT operations. a shifter unit. it predicts and issues DSP operates on 8-. The product offerings include the scalar SP-3 and the binarycode-compatible. The DSPs support prioritized interrupts and 32 levels of nesting. The shifter unit can handle both signed and unsigned shifts. The company also offers a synthesizable on-chip system bus controller with 10 prioritized channels. The company designed its DSPs modularly to allow reconfigurability. The SP-3 and SP-5. The CPU resolves the 16-. the decoding logic forwards the operand accordingly. two 16-bit operations. A system developer can program one-fourth of the program-memory space as set-associative cache. The result of the instruction goes only to the register file. respectively. you can also partition the memory into multiple banks to improve DMA performance. If the CPU mispredicts the branch. In both DSPs. Both cores have single-instruction-multiple-data (SIMD) capability. respectively. a software library. the DSP incurs no penalty. One instruction can manipulate four 8-bit operations. in which the operands of the instruction come from either the data memory or the register file. and a graphical-user-interface-based hardware debugger. Addressing modes static-branch prediction. a saturate/add unit. or four 8bit adds. The address-generation unit has eight circular buffers and four page registers. you can link two of the accumulators to provide 84-bit accuracy for multiprecision MAC operations. the branch result. tions support zero-overhead looping. If the branch prediction is correct. the penalty ranges from one-half to two cycles. focuses on providing fixed-point. include two and four 56-bit accumulators.You can also purchase the DSPs as hard cores. Both cores use one program memory and two data memories. The decoding unit of the CPU checks all data dependencies between instructions in the pipeline. a barrel shift. Packed operations The DSP’s process-control instrucsupport SIMD. The SP-3 and SP-5.
system functions to handle interrupts. discrete-cosine transform. including support for digital-still-camera applications.JTAG-compatible debugging port.263 image-compression algorithms. H. and a 2-D. and DMA requests. The company also provides software libraries. FFTs. standard voice codecs. semaphore queues for synchronizing events between tasks. An RTOS kernel supports multitasking applications and handles multiple priorities for tasks. state preservation. .
It provides one-cycle. Results of the MAC operation land in the product register and the 24-bit accumulator during each cycle.zilog. and conditional branching and subroutine calls. mulator. which automatically feed directly into the MAC’s input registers each cycle. but the data Device does no hardis unavailable for processing until the ware-repeat looping or next instruction cycle. You can insert one wait state a 16 16. 2000 www. Modulo-addressing options include Modulo 2 to 256 for data access. external-peripheral addressing. a simulator. a linker. You can concatenate these timers for extended timing. such as an ADC.ednmag. and as many as three 8-bit ports.or 8k-word ROM or one-time-programmable (OTP) program ROM. and application libraries. You can also use the interface as a high-speed serial port or general-purpose counter. Addressing modes—The Z893xx supports memorydirect addressing for as many as 512 RAM-based words. treating the peripheral as a register.com . The architecture’s two RAM blocks can hold coefficients and data samples. and target emulation pods supporting a design-in environment. tional cycle for each instruction. 16-bit codecs. it also supports register-indirect addressing to RAM or ROM with pointer registers and immediate.to 16 24using software control to access slow bit multiplier with auperipherals. short-form direct addressing using 16-bit data registers in RAM. you can use the Wait pin for tomatic truncation. cess external peripheral devices. Z893x1 devices have a codec interface that is compatible with 8-bit PCMs. a linker/loader. Although the Z893xx lacks a barrel shifter. You can adapt many general-purpose peripherals. conditional execution of certain instructions. including 8.and 16-bit ADCs and DACs. Also available is the 3xx Compass/IDE. It also has a PLL-driven system clock that drives the DSP to operate as fast as 20 MHz from a 32kHz watch crystal. an assembler.html for additional information. The Z893x1 chips also have one 13-bit timer for the CODEC interface and one 13-bit timer for general-purpose uses.dspdirectory 16 bits Zilog Z893x1/Z893x3 Zilog built the Z893xx family’s architecture around a single-cycle multiply-accumulate (MAC) unit. a highspeed SPI.com/products/dspapp. and a source-level debugger. modified Harvard You can use the external I/O bus to acarchitecture. Two internal bus sets—a program-address/data-bus set and a data-address/data-bus set—allow the processor to access program and data concurrently with a MAC operation. You can treat the product register as a general-purpose register when it is not performing multiplies. additional wait states. which eliminates the need for data-address-generation code for each MAC cycle. Zilog offers emulators and evaluation boards. and 64-bit stereo sigma-delta codecs. Support—Zilog offers its Zilog Developer Studio which comprises a macroassembler. OTP programming adapters. The Z893x3 has an 8-bit half-flash ADC. Check out www. The DSP runs from a 4k. three counter/timers. An external read/write takes The MAC unit includes one cycle. to this interface. Special instructions—The Z893xx performs compare register to accumulator. which comprises a C compiler. bit manipulation. RAM-block addressing automatically increments or decrements the address. Running code Features include an from external memory takes one addiexternal I/O bus. 92 edn | March 30. which includes a 24-bit product register and a 24-bit accumulator and arithmetic-logic unit with no guard bits. a shifter between the product register and ALU allows you to shift the product result right Device has an accuby 3 bits before adding it to the accumulator-based. the chip reads the data in one cycle.
performance degrades. The shifter also supports instructions for bit-stream parsing and generation. These devices include an on-chip enhanced filter coprocessor (EFCOP) that processes filter algorithms in parallel with the DSP’s core operation.htm. simulator one decode. The DMA has separate address and data buses. A range of based. hardware. The 563xx’s dress generations. Although the mechanism prevents unrecoverable stack overflows. It operates in modes to perform real. com). erate at 1. which often avoid the need to change program flow. an assembler. The static core operates from dc to 80 MHz and uses a PLL with a built-in prescaler that allows dynamic clock throttling. The newest members of the 56300 family are the 56307 and 56311. as well as a port into DMA. Unlike the DSP56000.dspdirectory 24 bits Motorola DSP563xx The 563xx is Motorola’s highest performance fixedpoint DSP architecture. PC-relative. such as LD-CELP (low-delay code-excited linear prediction). now part of Motorola. unify the look and feel of development tools for new processors under the Filter coprocessor opMetrowerks Code Warrior style. and core logic on every instruction. Special instructions—The 563xx’s barrel shifter supports multibit-shift instructions in both directions and by any number of bits. Although a branch penalty is three cycles. The 563xx is binary-code-compatible with the 56000. and a C compiler. The EFCOP has its own access to memory.3V and have 5Vload them from www. The 563xx performs 16-bit arithmetic that is useful for handling various compression algorithms. Motorola expanded addressing on the 563xx to the full 24 bits. The EFCOP provides performance improvements for tasks such as voice coding and echo cancellation. X.You can use an applicationdevelopment system to evaluate the chip and debug target systems. This approach permits execution to “catch up” with data dependencies. Motorola provides the Suite56 al-ALU instructions. but the 563xx also supports addressing modes that include address-register program-counter (PC) relative. the DSP563xx implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. which has a 16location stack limit. When the processor executes a single-cycle multiply-accumulate (MAC) operation. erates concurrently tasking. You can program the size of the program RAM. will 3.and complex-FIR.motorola.8V and have Metrowerks. amine all internal buses in real time and record the last 12 change-of-flow inDevice has conditionstructions.domaitec. adaptive-FIR filtering. up from 16 bits on the 56000 family. when using a 24-bit architecture for 16-bit arithmetic. infinite-impulse-response filtering. and JTAG-based OnCE port allows you to extwo executions. If the test condition is false. erates in parallel with the core. and Y) or among memory and peripherals or the external host buses (PCI or ISA). www. achieving single-cycle instruction execution. two adsoftware. the core automatically powers down unused memories. commodule. some opSPS/WIRELESS/dsptools/index.com/ tolerant I/O. units. Support—Motorola backs the 563xx family with a host of development tools. and the second stage does the accumulate. a comprising two fetches. Third-party tools include a comSix-channel DMA oppiler and debugger from Tasking (www. Normally. Addressing modes—The 563xx supports register-direct. and Y-data RAM.ednmag. 2000 | edn 93 . the chip takes a two-clock penalty when externally dumping stack entries. X-data RAM. immediate absolute addressing.com) and a debugger from with core’s execution Domain Technologies (www.com March 30. The 563xx uses an interlocking mechanism that automatically inserts a nonoperation (NOP) instruction into the pipe to avoid stalls. the processor executes an NOP instruction. mand converter. This mode is useful for multitasking and position-independent code. and multichannel-FIR filtering. which lets a programmer deliver and relocate object modules without relinking to the original code. the 563xx supports conditional ALU instructions. The device can conditionally execute parallel ALU instructions based on all possible condition codes. address-register-indirect. a host-interface card. peripherals.and software-development Architecture is registertools for the DSP563xx family. For additional power savings. instruction cache. or you can downat 3. The system Device has a sevencomes with an application-development stage pipeline. the first execution stage does the multiply. third-party tools complements these tools. The DMA transfers data among memories (P. The Motorola software tools are Most devices operate available on CD-ROM.3V-tolerant I/O. because you have to round the 24-bit numbers in software.
and a register file. (Point-to-point connections between DSP ports define each processor in the mesh. transferring data among internal and external memory and all peripherals. A parallel port serves as a direct interface to off-chip memory. Both single-instruction-multiple-data (SIMD) and singleinstruction-single-data (SISD) versions are available.and off-chip memory. and peripherals interconnect and perform flexible. combined with the I/O processor. The I/O controller separates these transfers from mainstream DSP. tation units and the data buses and to store intermediate results. To reduce bottlenecks. integrate four internal buses. operations. The synchronous serial ports support time-divisionmultiplexed serial streams and hardware companding and can transfer data as fast as 40 Mbps. providing a direct interface to offchip memory. Some SHARC chips contain as much as 512 kbytes of on-chip memory organized into two banks of dual-port RAM. a subtract. allows the core and the DMA to simultaneously access internal SRAM. a shifter. runs. the ADSPSHARC architecture is 21160 duplicates this action. multiplier. and six link ports. transferring as much as 1 byte/clock cycle. and shifter operate in parallel to perform multifunction. The I/O controller manages all DMA channels. and responds to these ports. 32-. The SHARC’s CPU executes using on. as many as eight serial ports. SHARC chips have two high-speed serial ports and a host/parallel port. SHARCs offer a unified address space using a 32-bit address bus and a 32. single-cycle instructions. or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for code and data. dual-banked nature of the memory. SHARC’s I/O controller executes I/O transfers in parallel with CPU execution. The DAGs.or off-chip peripherals—all in one cycle. These ports feed through the I/O controller and let you create meshes of DSPs that can access each other’s memory spaces. a large on-chip memory. maximum throughput is 600 Mbytes/sec. a multiplier. You can use this memory to store a combination of 16-. the chip supports a 10-nsec access time with zero-wait-state memory. data memory for data.ednmag. automatically handle addresspointer wraparound. the ADSP2106x DSPs can conditionally execute a multiply.and data-memory buses and on-chip instruction cache. 48-bit-wide instruction cache is selective—caching only the DSP supports fixedinstructions whose fetches conflict with and floating-point accesses to program-memory data. or a host processor. and 21062 provide six communication ports for array multiprocessing. or ADSP-2106x and the second-generation ADSP2116x. The dual-ported. I/O controller. With its separate program. Within the SISD CPU core. The DMA controller allows you to dynamically control the external-memory-bus width. With six links operating simultaneously. which may start and end at any memory location. an add. 21060. The 211660. the processor can simultaneously fetch two operands and an instruction from cache in one cycle. SHARC DSPs feature two data-address generators (DAGs). SHARC DSPs feature an enhanced Harvard architecture in which the data-memory bus transfers data and the program-memory bus transfers both instructions and data. peripherals. which implement circular data buffers. such as the host port. as well as system buses. the six communication ports move data in 4-bit nibbles. These DAGs contain sufficient registers to allow you to create as many as 16 primary and 16 secondary circular buffers. the interconnect crossbar permits unlimited data and instruction movement from external or internal memory or cache and permits I/O from on. The SIMD core adds a second compute block that includes an additional parallel ALU. nonintrusive transfers through a multibus-crossbar-interconnection unit. Transfers pass through the I/O ports to and from internal memory. The CPU. and a host processor. and an off-chip load using the chip’s I/O controller. The I/O controller offloads reads and writes between on. such as ISA and PCI.and floating-point SHARC DSPs. Link ports facilitate interprocessor communication and bus arbitration among as many as six ADSP-2106x chips. www. The 32-entry. The SHARC DSP uses a general-purThe new Hammerpose.and 32-bit Ps. In all but the ADSP-21065L. programming. 10-port. The host treats the SHARC as a memory-mapped device with direct writes or reads to internal memory. 2000 . The special host interface supports both 16. For a 100-MHz clock. and a branch in one instruction. All DMA operations generally do not interrupt or delay core thread execution. This arrangement allows both computation blocks to process the same instruction but operating on different data.or 48-bit data bus. For example. peripherals. the ALU.) The on-chip I/O controller sets up. 32-register data-register head operates on file to transfer data between the compuSIMD. and an I/O controller to offload I/O.dspdirectory 32 bits Analog Devices SHARC DSP The 32-bit fixed. The 48-bit known for having instruction word accommodates a varilarge integrated ety of parallel operations for concise SRAMs.com 94 edn | March 30.or off-chip memory. As many as six SHARCs can share this interface with a host processor.
and floating-point compare. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP.The lowest priced SHARC DSP. 2000 www.ednmag. and a 40-bit extended IEEE format for additional accuracy (32-bit data). an assembler. Analog Devices based the SHARC assembly language on an algebraic syntax. division iteration. circular-modulo. and sign bit). in-circuit emulators. also provides a synchronous DRAM (SDRAM) interface that transfers data to and from SDRAM as fast as 240 Mbytes/sec. Support—Analog Devices’ software. xx edn | Month XX. 8-bit exponent. Addressing modes—SHARC offers immediate. a linker. and conditional execution of most instructions. single and block repeat with zero-overhead looping. Analog Devices’ emulators are available for Universal Serial Bus. conditional subroutine call. and a debugger. SHARC supports IEEE-754 single-precision. a simulator. The glueless SDRAM interface can access 16. reciprocal of square-root seed. and register-direct and -indirect addressing. the ADSP-21065. (It must use indirect addressing for off-chip memory access.or 64-Mbyte SDRAMs and enables you to connect to any one of four external memory banks. indexed.and hardware-development tools include the company’s VisualDSP integrated development environment.) Special instructions—SHARC provides bit manipulation. VisualDSP provides the interface to an optimizing C compiler. floating-point (23-bit data. and a development kit.com . PCI. bit-reversed. fixed. or twice the clock frequency. and Ethernet host platforms.
32-bit barrel shifter/ALU. register-indirect. The memory/access subsystem comprises separate program. and an application library. architecture. The new TMS320VC33 has two additional. The TMS320C3x DSP comprises memory/access. and starter kits. multiple OS products. The C3x uses a software stack to support context switching. and a 32kbyte ROM can hold code or coefficients for MAC processing (C30 only). the C3x processor performs single-cycle MAC processing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. or the lower 24-bit mantissa of the floating-point registers. flexible. delayed branchDevice features 32es. data reads and writes. and key-control registers reside in a central multiported register file of 32 registers. However. a parallel. an integer/floating-point multiplier. a source-level debugger. On the DSP side. Special instructions—The C3x performs single. auxiliary registers. Two address generators in the subsystem generate the addresses to access the data memories. Check out www. on-chip program cache automatically loads as the DSP accesses instructions from external memory. and the memory subsystem. the difference in data format is relevant only if you are passing the data to another processor. which empty the pipe.or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status). Addressing modes—The C3x supports register-direct. The C3x family does not support IEEE floatingpoint formats to help reduce core and code size. eight 40-bit extendedprecision registers. and immediate addressing.ednmag. In most applications. filter-design packages. which wait three program cycles bit. the C3x supports a unified. The core registers. On the µP side. floating-point before changing the program counter. The core stores results in extended-precision or auxiliary registers.dspdirectory 32 bits Texas Instruments TMS320C3x TI’s TMS320C3x integrates a 32-bit. and hardware tools. This internal busing scheme enables programs to access the next instruction and two data values simultaneously and to transfer data to or from the I/O subsystem in one cycle. Support—TI supplies full-speed in-circuit emulators. The C3x also performs fixed-point math based on 24-bit-wide data. paged-memory-direct. The central core has its own set of buses to move data and results. One 64-word. lacks JTAG support but has a proprietary five-pin emulation interface. standard branches. which serves the DMA controller and peripherals. central-core. a code profiler. the I/O. comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. Third-party tools include C and ADA compilers.com March 30. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus. lockable. 24-bit address space (16 Mbytes×32-bit words). Although most designers use the C3x for its floating-point capability. data. fixed-point math is occasionally useful for functions such as clipping of image data. The circular buffer requires block-size and basepointer registers plus an auxiliary register that the buffer shares with X and Y memories. to integer and vice versa.com/sc/docs/tools/dsp/ index. The third C3x subsystem. advanced graphical-design tools. The C3x. which allow parallel program fetches. These buses move data among internal registers. The C3x format uses an implied sign bit to increase precision.ti. TI sells a tool set that includes a C compiler. an assembler/linker.html for more information. interlocked access instructions for multiprocessing (load/store integer or floatHarvard-like architecing-point value and signal interlocked). and conversion of floating-point ming environment. The processor receives the next instruction while accessing two data values for the current instruction’s MAC cycle. floating-point DSP multiply-accumulate (MAC) core. a simulator. You can also specify instructions to execute in parallel. you can convert between the C3x and IEEE formats if necessary. and DMA operations. except for the C33. 2000 | edn 95 . www. and DMA buses. ture has a von Neucomputed gotos (dynamic subroutine mann-like programcalls). evaluation modules. The data-address buses share a data bus that can make two sequential RAM accesses in one cycle because the buses run at twice the speed of the processor core. 16k 32-bit RAM blocks. The two 4-kbyte dual access RAM blocks hold parameters and constants for sum-of-products MAC processing. and I/O subsystems.
The CPU uses a least recently used algorithm to select the cache segment for the new instructions.ednmag. and you can singlestep them all in lock step.com . (The C44 has only four internal buses. reciprocal and reciprocal square-root seed.You can also specify some instructions to execute in parallel. a source-level debugger for parallel debugging.) Key inner routines fill the cache as they run. Rather than time-sharing a single bus system. Such data movements do not overload the DSP with servicing overhead. The C4x also performs 32-bit. conversion of floating point to integer and vice versa. and a simulator. the C4x has a floating-point-unit multiplier. TI sells a C4x evaluation board with four processors that works with a number of host platforms. The CPU accesses an instruction from external memory and automatically loads the instruction into cache. You can freeze a segment in the cache by setting cache-freeze bits in the CPU-status register. (It does not use the cache with internal memory. fixed-point math based on either 32-bit memory values or the 32bit mantissa of its 40-bit floating-point registers. an assembler/linker. zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status). Addressing modes—The C4x supports register-direct. and conversion to and from IEEE floating-point formats. an ALU. TI offers an application library. which is divided into four 32-word segments or lines. the C4x features separate buses for program and two data fetches. One processor breakpoint can halt execution in an array of C4x chips. I/O can also use the external buses. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. Additionally. immediate. 96 edn | March 30.) Each port comprises eight data Family members have pins and four handshake signals. DSP has 32-bit. A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPU’s sequential threads. register-indirect. interlocked access for multiprocessing (load/store integer or floating-point value and signal interlocked). You can string multiple C4x chips on a JTAG circuit for parallel debugging. although some data contention for memory may slow CPU execution.dspdirectory 32 bits Texas Instruments TMS320C4x The C4x has seven internal buses and on-chip memories that help deliver single-cycle execution when walking through X and Y memories for a series of multiply-accumulate (MAC) operations. Program and data occupy a unified address space that you can configure according to your memory requirements. standard/delayed branches. paged-memory-direct. Special instructions—The C4x performs single or block instruction. These five-port register files. and Helios OSs. and a barrel shifter for parallel operations. A 128-word cache enables the processor to deliver single-cycle pipelined execution and still use slower external memory. Parallel C. and circular addressing to support single-sized circular buffers. address space (C40 and C44 only). Support—Development system includes scan-based emulation via the C4x’s JTAG test port. Software tools include a C compiler. tions ports support point-to-point communications with networks of C40s and Device features seven peripherals. as well as a variety of hardware tools. ports. ports free the 32-bit local and global external-memory buses for program or Device features a 128data accesses to the processor’s 4G-word word cache. floatSix 8-bit independent communicaing-point architecture. The local and global buses have different memory block assignments within each mem- ory space. Third-party support includes the Spox. 2000 www. The CPU applies bit-reversed operations to register-indirect addressing only. Virtuoso.
an assembler. blocks in the form of variable 1. sors transfer data between high-speed I/O communication ports and external Device supports vecmemory. Special instructions—The processor uses vector instructions for efficient work on packets of as many as 32 64-bit data words. and return instructions. tor-to-vector. indexed. which the company calls the NeuroMatrix engine. matrix-vector. The NM6403 has conditional branch. or vector-vector multiplication. a cycle-accurate simulator. The RISC core has a five-stage pipeline that operates with 32. DRAM. the 64-bit Vector coEach interface supports two memory processor. and matrix-topacked integer data comprising 64-bit matrix multiplication.and 64-bit-wide instructions. (Each instruction usually executes two operations. and operation as fast as 50 MHz.) Two 64-bit interfaces support Processor is dual-core SRAM. and relative addressing modes. It provides on-chip saturation and supports vector-vector. The Vector coprocessor’s core looks like an array multiplier. 2000 www.dspdirectory 64 bits Module Research Center’s NeuroMatrix NM6403 DSP The NeuroMatrix NM6403 DSP is a dual-core. It provides scalable performance. The NM6403 processor comprises a 32-bit RISC core and a patent-pending. a linker. DRAM and comprise two separate address-generation units that can access as Processor includes much as 16 Gbytes of address space. a load/exchange library. and two communication ports are hardware-compatible with TI’s TMS320C4x. These registers define the borders between rows and columns with macrocells. an instruction-level simulator.” application-specific DSP processor based on the NeuroMatrix architecture. This flexibility allows designers to trade precision with performance to suit their applications.com . The number of multiplications/accumulations depends on the length and number of words packaged into a 64bit block. This type of instruction may define such operations as matrix-matrix. Module’s NM6403 is the first “dual-core. superscalar processor that includes the 64-bit Vector coprocessor. 98 edn | March 30. a programmable operand width of 1 to 32 bits. To avoid arithmetic overflow. this approach allows you to build multiprocessor systems. the NM6403 uses two types of saturation functions with user-programmable saturation boundaries. and block moving. Two DMA coproceschip saturation. Support—The development tools for PCs cost $1995 and include an ANSI X3J16/95-0029 preliminarystandard-compatible C++ compiler. base. The vector-matrix library simplifies C-language programming and allows you to design DSP applications such as FFT. You can combine the cells into several macrocells by using two 64-bit programmable registers. and a set of application-specific vector-matrix libraries. 64-bit Vector coprocessor that supports vector operations with elements of variable bit length. The columns simultaneously calculate the results in one processor cycle. You can start the application with maximum precision and minimum performance and then dynamically increase the performance by reducing the data-word lengths.ednmag. and matrix-matrix multiplication. the Vector coprocessor performs 24 multiplication/accumulations with 21-bit results in one 20-nsec processor cycle. Two identical programmable interfaces work with any memory types. The structure comprises cells that include a 1-bit memory (flip-flop) surrounded by several logical elements. call. Each macrocell performs the multiplication on variable input words using preloaded coefficients and accumulates the result from the macrocells in the column above it. For 8-bit data and coefficients. a source-level debugger. banks and can function in a “sharedDevice provides onmemory’’ mode. vector-matrix. The engine’s configuration can change dynamically during the calculations. and extended-data-out and superscalar. vector-toThe Vector coprocessor works on matrix. and Hadamard Transform. The hardware also supports vector-matrix or matrix-matrix multiplication.to 64bit words. vector-vector addition/subtraction with saturation of results. RC Module also offers a dual CPU PCI evaluation/development board for $1495. Addressing modes—The NM6403 supports immediate with 32 bits. Sobel. Module also provides an NM6403 Verilog model for Sun host platforms for system-level simulation and pc-board design.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.