Sanjay - High Performance DSP Architectures

High Performance DSP Architectures
CHAPTER 1
EVOLUTION OF DSP PROCESSORS
INTRODUCTION
Digital Signal Processing is carried out by mathematical operations. Digital Signal

Processors are microprocessors specifically designed to handle Digital Signal Processing
tasks. These devices have seen tremendous growth in the last decade, finding use in
everything from cellular telephones to advanced scientific instruments. In fact, hardware
engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use
"DSP" to mean Digital Signal Processing.
DSP has become a key component in many consumer, communications, medical, and
industrial products. These products use a variety of hardware approaches to implement
DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate
arrays (FPGAs) to custom integrated circuits (ICs). Programmable “DSP processors,” a
class of microprocessors optimized for DSP, are a popular solution for several reasons.
In comparison to fixed-function solutions, they have the advantage of potentially being

reprogrammed in the field, allowing product upgrades or fixes. They are often more cost-
effective than custom hardware, particularly for low-volume applications, where the
development cost of ICs may be prohibitive. DSP processors often have an advantage in
terms of speed, cost, and energy efficiency.
DSP ALGORITHMS MOULD DSP ARCHITECTURES
From the outset, DSP processor architectures have been moulded by DSP algorithms. For
nearly every feature found in a DSP processor, there are associated DSP algorithms whose
computation is in some way eased by inclusion of this feature. Therefore, perhaps the best
way to understand the evolution of DSP architectures is to examine typical DSP
algorithms and identify how their computational requirements have influenced the
architectures of DSP processors.
FAST MULTIPLIERS
The FIR filter is mathematically expressed as a vector of input data, along with a vector of
filter coefficients. For each “tap” of the filter, a data sample is multiplied by a filter
coefficient, with the result added to a running sum for all of the taps . Hence, the main
component of the FIR filter algorithm is a dot product: multiply and add, multiply and add.
These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of
the most common operations performed in signal processing convolution, IIR filtering, and
Fourier transforms also all involve heavy use of multiply-accumulate operations.
Originally, microprocessors implemented multiplications by a series of shift and add
operations, each of which consumed one or more clock cycles. As might be expected, faster
multiplication hardware yields faster performance in many DSP algorithms, and for this
reason all modern DSP processors include at least one dedicated single- cycle multiplier or
combined multiply-accumulate (MAC) unit .
Department Of Electronics & Communication Engineering, GEC Thrissur. 1

MULTIPLE EXECUTION UNITS
DSP applications typically have very high computational requirements in comparison to

other types of computing tasks, since they often must execute DSP algorithms in real time
on lengthy segments of signals sampled at 10-100 KHz or higher. Hence, DSP processors
often include several independent execution units that are capable of operating in parallel
for example, in addition to the MAC unit, they typically contain an arithmetic- logic unit
(ALU) and a shifter.
EFFICIENT MEMORY ACCESSES
Executing a MAC in every clock cycle requires more than just a single-cycle MAC unit. It
also requires the ability to fetch the MAC instruction, a data sample, and a filter coefficient
from memory in a single cycle. To address the need for increased memory bandwidth, early
DSP processors developed different memory architectures that could support multiple
memory accesses per cycle. Often, instructions were stored in the memory bank, while
data was stored in another. With this arrangement, the processor could fetch an instruction
and a data operand in parallel in every cycle.
Since many DSP algorithms consume two data operands per instruction, a further
optimization commonly used is to include a small bank of RAM near the processor core
that is used as an instruction cache. When a small group of instructions is executed
repeatedly, the cache is loaded with those instructions, freeing the instruction bus to be
used for data fetches instead of instruction fetches thus enabling the processor to execute a
MAC in a single cycle. High memory bandwidth requirements are often further supported
via dedicated hardware for calculating memory addresses. These address generation units
operate in parallel with the DSP processor’s main execution units, enabling it to access data
at new locations in memory without pausing to calculate the new address.
Memory accesses in DSP algorithms tend to exhibit very predictable patterns; for example,
for each sample in an FIR filter, the filter coefficients are accessed sequentially from start
to finish for each sample, then accesses start over from the beginning of the coefficient
vector when processing the next input sample. DSP processor address generation units take
advantage of this predictability by supporting specialized addressing modes that enable the
processor to efficiently access data in the patterns commonly found in DSP algorithms. The
most common of these modes is register-indirect addressing with post-increment, which is
used to automatically increment the address pointer for algorithms where repetitive
computations are performed on a series of data stored sequentially in memory. Many DSP
processors also support “circular addressing,” which allows the processor to access a block
of data sequentially and then automatically wrap around to the beginning address exactly
the pattern used to access coefficients in FIR filtering. Circular addressing is also very
helpful in implementing first-in, first-out buffers, commonly used for I/O and for FIR filter
delay lines.

DATA FORMAT
DSP applications typically must pay careful attention to numeric fidelity. Since numeric
fidelity is far more easily maintained using a floating point format, it may seem surprising
that most DSP processors use a fixed-point format. DSP processors tend to use the shortest
data word that will provide adequate accuracy in their target applications. Most fixed-point
DSP processors use 16-bit data words, because that data word width is sufficient for many
DSP applications. A few fixed-point DSP processors use 20, 24, or even 32 bits to enable
better accuracy in applications that are difficult to implement well with 16-bit data, such as
high-fidelity audio processing.
To ensure adequate signal quality while using fixed-point data, DSP processors typically
include specialized hardware to help programmers maintain numeric fidelity throughout a
series of computations. For example, most DSP processors include one or more
“accumulator” registers to hold the results of summing several multiplication products.
Accumulator registers are typically wider than other registers; they often provide extra bits,
called “guard bits,” to extend the range of values that can be represented and thus avoid
overflow. In addition, DSP processors usually include good support for saturation
arithmetic, rounding, and shifting, all of which are useful for maintaining numeric fidelity.
ZERO-OVERHEAD LOOPING
DSP algorithms typically spend the vast majority of their processing time in relatively
small sections of software that are executed repeatedly; i.e., in loops. Hence, most DSP
processors provide special support for efficient looping. Often, a special loop or repeat
instruction is provided which allows the programmer to implement a for-next loop without
expending any clock cycles for updating and testing the loop counter or branching back to
the top of the loop. This feature is often referred to as “Zero-overhead looping.”
STREAMLINED I/O
Finally, to allow low-cost, high-performance input and output, most DSP processors
incorporate one or more specialized serial or parallel I/O interfaces, and streamlined I/O
handling mechanisms, such as low-overhead interrupts and direct memory access (DMA),
to allow data transfers to proceed with little or no intervention from the processor's
computational units.
SPECIALIZED INSTRUCTION SETS
DSP processor instruction sets have traditionally been designed with two goals in mind.
The first is to make maximum use of the processor's underlying hardware, thus increasing
efficiency. The second goal is to minimize the amount of memory space required to store
DSP programs, since DSP applications are often quite cost-sensitive and the cost of
memory contributes substantially to overall chip and/or system cost. To accomplish the
first goal, conventional DSP processor instruction sets generally allow the programmer to
specify several parallel operations in a single instruction, typically including one or two
data fetches from memory in parallel with the main arithmetic operation. With the second
goal in mind, instructions are kept short by restricting which registers can be used with
which operations, and restricting which operations can be combined in an instruction.

CHAPTER 2
TRADITIONAL SOLUTIONS FOR REAL TIME PROCESSING
DSP architectures designs have traditionally focused on providing and meeting real-time
constraints. Advanced signal processing algorithms, such as those in base station receivers,
present difficulties to the designer due to the implementation of complex algorithms, higher
data rates and desire for more channels per hardware module. A key constraint from the
manufacturing point of view is attaining a high channel density.
Traditionally, real-time architecture designs employ a mix of DSP’s, Co-processor’s,

FPGA’s, ASIC’s and Application Specific Standard Parts (ASSP’s) for meeting real-time
requirements in high performance applications. The chip rate processing is handled by the
ASSP, ASIC or FPGA while the DSP’s handle the symbol rate processing and use co-
processors for decoding. The DSP can also implement parts of the MAC layers and control
protocols or can be assisted by a RISC processor.
However, dynamic variations in the system workload such as variations in the number of
users in wireless base-stations, will require a dynamic re-partitioning of the algorithms
which may not be possible to implement in traditional FPGA’s and ASIC’s in real-time.
LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES

Single processors DSP’s can only have limited arithmetic units and cannot directly extend
their architectures to 100’s of arithmetic units. This is because, as the number of arithmetic
units increases in an architecture, the size of the register files and the port interconnections
start dominating the architecture.
PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES

Multiprocessor architectures can be classified into Single Instruction Multiple Data (SIMD)
and Multiple Instruction Multiple Data (MIMD) architectures. Data-parallel DSP’s exploit
data parallelism, instruction level parallelism and sub word parallelism. Alternate levels of
parallelism such as thread level parallelism exist and can be considered after this
architecture space has been fully studied and explored.

MULTI-CHIP MIMD PROCESSORS

Each processor in a loosely coupled system has a set of I/O devices and a large local
memory. Processors communicate by exchanging messages using some form of message-
transfer system. Loosely coupled systems are efficient when interaction between tasks are
minimal. The tradeoffs of this processor design have been the increase in programming
complexity and the need for high I/O bandwidth and inter-processor support. Such MIMD
solutions are also difficult to scale with processors. E.g.: TI ‘C4XX.
Register file explosion in traditional DSP’s with centralized register files.
The disadvantages of the multi-chip MIMD model and architectures are the following:
1. Load-balancing algorithms for such MIMD architectures is not straight-forward
similar to heterogeneous systems. This makes it difficult to partition algorithms on
this architecture model especially when the workload changes dynamically.
2. The loosely coupled model is not scalable with the number of processors due to
interconnection and I/O bandwidth issues.
3. I/O impacts the real-time performance and power consumption of the architecture.
4. Design of a compiler for a MIMD model on a loosely coupled architecture is difficult
and the burden is left to the programmer to decide on the algorithm partitioning on
the multiprocessor.
SINGLE-CHIP MIMD PROCESSORS

Single-chip MIMD processors can be classified into 3 categories: single-threaded chip
multiprocessors (CMP’s), multi-threaded multiprocessors (MT’s) and clustered VLIW
architectures . A CMP integrates two or more complete processors on a single chip.
Therefore, every unit of a processor is duplicated and used independently of its copies.

In contrast, a multi-threaded processor interleaves the execution of instructions of various

threads of control in the same pipeline. Therefore, multiple program counters are available
in the fetch unit and multiple contexts are stored in multiple registers on the chip. Multi-
threading increases instruction level parallelism in the arithmetic units by providing access
to more than a single independent instruction stream. Programmer is assigned the duty to
schedule the threads of his application..
Clustered VLIW architectures are another example of VLIW architectures that solve the
register explosion problem by employing clusters of functional units and register files.
Clustering improves cycle time in two ways: by reducing the distance the signals have to
travel within a cycle and by reducing the load on the bus. Clustering is beneficial for
applications which have limited inter-cluster communication. However, compiling for
clustered VLIW architectures can be difficult in order to schedule across various clusters
and minimize inter-cluster operations and their latency.
Although single chip MIMD architectures eliminate the I/O bottleneck between multiple
processors, the load balancing and architecture scaling issues still remain. The availability
of data parallelism in signal processing applications is not utilized efficiently in MIMD
architectures.
SIMD ARRAY PROCESSORS

SIMD processing refers to processing of identical processors in the architecture that
execute the same instruction but work on different sets of data in parallel. An SIMD array
processor is referred to processor designs targeted towards implementation of arrays or
matrices. There are various types of interconnection methodologies used for array
processors such as linear array (vector), ring, star, tree, mesh, systolic arrays and
hypercubes. For example, Illiac-IV , Burroughs Scientific Processor (BSP). Although
vector processors have been the most popular version of array processors, mesh based
processors are still being used in scientific computing.
SIMD VECTOR PROCESSORS

Data parallelism allow vector processors to approach the performance and power efficiency
of custom designs, while simultaneously providing the flexibility of a programmable
processor. Vector machines were the first attempt at building super-computers, starting
from the Cray-1 machine These processors executed vector instructions such as vector adds
and multiplications out of a vector register file. The number of memory banks is equal to
the number of processors such that all processors can access memory in parallel.

DATA-PARALLEL DSPS
Data-parallel DSP’s as architectures that exploit ILP. Stream processors are state-of-the-art
programmable architectures aimed at media processing applications. Stream processors
enhance data-parallel DSP’s by providing a bandwidth hierarchy for data flow in signal
processing applications that enable support for hundreds of arithmetic units in the data-
parallel DSP.
PIPELINING MULTIPLE PROCESSORS

An alternate method to attain high data rates is to provide multiple processors that are
pipelined. Such processors would be able to take advantage of the streaming flow of data
through the system. The disadvantages of such a design are that the architecture would
need to be carefully designed to match the system throughput and is not flexible enough to
adapt to changes in system workload. Also, such a pipelined system would be difficult to
program and suffer from I/O bottlenecks unless implemented as a SoC. However, this is the
only way to provide desired system performance if the amount of parallelism exploitation
does not meet the system requirements.

CHAPTER 3
CURRENT DSP LANDSCAPE
COVENTIONAL DSP PROCESSORS
The performance and price range among DSP processors is very wide. In the low-cost, low-
performance range are the industry workhorses, which are based on conventional DSP
architecture. They issue and execute one instruction per clock cycle, and use the complex,
multi-operation type of instructions described earlier. These processors typically include a
single multiplier or MAC unit and an ALU, but few additional execution units, if any.
Included in this group are Analog Devices ADSP-21xx family, Texas Instrument’s
TMS320C2xx family, and Motorola's DSP560xx family. These processors generally
operate at around 20-50 MHz, and provide good DSP performance while maintaining very
modest power consumption and memory usage. Midrange DSP processors achieve higher
performance than the low-cost DSP’s described above through a combination of increased
clock speeds and somewhat more sophisticated architectures.
ENHANCED CONVENTIONAL DSP PROCESSORS
DSP processor architects improved performance by extending conventional DSP

architectures by adding parallel execution units, typically a second multiplier and adder.
These hardware enhancements are combined with an extended instruction set that takes
advantage of the additional hardware by allowing more operations to be encoded in a single
instruction and executed in parallel. We refer to this type of processor as an “enhanced-
conventional DSP processor,” because it is based on the conventional DSP processor
architectural style rather than being an entirely new approach. With this increased
parallelism, enhanced-conventional DSP processors can execute significantly more work
per clock cycle—for example, two MAC’s per cycle instead of one.
Enhanced-conventional DSP processors typically have wider data buses to allow them to
retrieve more data words per clock cycle in order to keep the additional execution units fed.
They may also use wider instruction words to accommodate specification of additional
parallel operations within a single instruction.
MULTI-ISSUE ARCHITECTURES
With the goals of achieving high performance and creating an architecture that lends itself
to the use of compilers, some newer DSP processors use a “multi-issue” approach.

In contrast to conventional and enhanced-conventional processors, multi-issue processors

use very simple instructions that typically encode a single operation. These processors
achieve a high level of parallelism by issuing and executing instructions in parallel groups
rather than one at a time. Using simple instructions simplifies instruction decoding and
execution, allowing multi-issue processors to execute at higher clock rates than
conventional or enhanced conventional DSP processors.eg:.TMS320C62xx, The two
classes of architectures that execute multiple instructions in parallel are referred to as
VLIW and Superscalar. These architectures are quite similar, differing mainly in how
instructions are grouped for parallel execution.
VLIW and superscalar architectures provide many execution units each of which executes
its own instruction. VLIW DSP processors typically issue a maximum of between four and
eight instructions per clock cycle, which are fetched and issued as part of one long super-
instruction hence the name “Very Long Instruction Word.” Superscalar processors typically
issue and execute fewer instructions per cycle, usually between two and four. In a VLIW
architecture, the assembly language programmer specifies which instructions will be
executed in parallel. Hence, instructions are grouped at the time the program is assembled,
and the grouping does not change during program execution. Superscalar processors, in
contrast, contain specialized hardware that determines which instructions will be executed
in parallel based on data dependencies and resource contention, shifting the burden of
scheduling parallel instructions from the programmer to the processor. The processor may
group the same set of instructions differently at different times in the program's execution;
for example, it may group instructions one way the first time it executes a loop, then group
them differently for subsequent iterations. The difference in the way these two types of
architectures schedule instructions for parallel execution is important in the context of
using them in real-time DSP applications. Because superscalar processors dynamically
schedule parallel operations, it may be difficult for the programmer to predict exactly how
long a given segment of software will take to execute. The execution time may vary based
on the particular data accessed, whether the processor is executing a loop for the first time
or the third, or whether it has just finished processing an interrupt, for example. Dynamic
features also complicate software optimization. As a rule, DSP processors have
traditionally avoided dynamic features for just these reasons; this may be why there is
currently only one example of a commercially available superscalar DSP processor.
In VLIW architectures, a wide instruction word may be required in order to specify

information about which functional unit will execute the instruction. Wider instructions
allow the use of larger, more uniform register sets, which in turn enables higher
performance. There are disadvantages, however, to using wide, simple instructions. Since
each VLIW instruction is simpler than a conventional DSP processor instruction, VLIW
processors tend to require many more instructions to perform a given task. Combined with
the fact that the instruction words are typically wider than those found on conventional
DSP processors, this characteristic results in relatively high program memory usage. High
program memory usage, in turn, may result in higher chip or system cost because of the
need for additional ROM or RAM.
VLIW processors typically use either wide buses or a large number of buses to access data
memory and keep the multiple execution units fed with data. The architectures of VLIW
DSP processors are in some ways more like those of general-purpose processors than like
those of the highly specialized conventional DSP architectures.

VLIW and superscalar processors often suffer from high energy consumption relative to
conventional DSP processors, however in general, multi-issue processors are designed with
an emphasis on increased speed rather than energy efficiency. These processors often have
more execution units active in parallel than conventional DSP processors, and they require
wide on-chip buses and memory banks to accommodate multiple parallel instructions and
to keep the multiple execution units supplied with data, all of which contribute to increased
energy consumption.
Because they often have high memory usage and energy consumption, VLIW
and superscalar processors have mainly targeted applications which have very demanding
computational requirements but are not very sensitive to cost or energy efficiency. For
example, a VLIW processor might be used in a cellular base station, but not in a portable
cellular phone.
On DSP processors with SIMD capabilities, the underlying hardware that supports SIMD
operations varies widely. Analog Devices, for example, modified their basic conventional
floating-point DSP architecture, the ADSP- 2106x, by adding a second set of execution
units that exactly duplicate the original set. The augmented architecture can issue a single
instruction and execute it in parallel in both sets of execution units using different data
effectively doubling performance in some algorithms.
In contrast, instead of having multiple sets of the same execution units, some DSP
processors can logically split their execution units into multiple sub-units that process
narrower operands. These processors treat operands in long registers as multiple short
operands. Perhaps the most extensive SIMD capabilities we have seen in a DSP processor
to date are found in Analog Devices' TigerSHARC processor. TigerSHARC is a VLIW
architecture, and combines the two types of SIMD: one instruction can control execution of
the processor's two sets of execution units, and this instruction can specify a split-
execution-unit operation that will be executed in each set. Using this hierarchical SIMD
capability, TigerSHARC can execute eight 16-bit multiplications per cycle SIMD is only
effective in algorithms that can process data in parallel; for algorithms that are inherently
serial, SIMD is generally not of use.

CHAPTER 4
DIVERGING ARCHITECTURES
Up until recently, DSP processor designs were improved primarily by incremental
enhancements; new DSP’s tended to maintain a close resemblance to their predecessors. In
the last couple of years, however, DSP architectures have become much more interesting,
with a number of vendors announcing new architectures that are completely different from
preceding generations.
HIGH-PERFORMANCE DSPS
Processor designers who want higher DSP performance than can be squeezed out of
traditional architectures have come up with a variety of performance-boosting strategies.
The main idea is that if you want to improve performance beyond the increase afforded by
faster clock speeds, you need to increase the amount of useful work that gets done every
clock cycle. This is accomplished by increasing the number of operations that are
performed in parallel, which can be implemented in two main ways: by increasing the
number of operations performed by each instruction, or by increasing the number of
instructions that are executed in every instruction cycle.
INCREASING THE WORK PERFORMED BY EACH INSTRUCTION
Traditionally, DSP processors have used complex, compound instructions that allow the
programmer to encode multiple operations in a single instruction. In addition, DSP
processors traditionally issue and execute only one instruction per instruction cycle. This
single-issue, complex-instruction approach allows DSP processors to achieve very strong
DSP performance without requiring a large amount of program memory.
One method of increasing the amount of work performed by each instruction while
maintaining the basics of the traditional DSP architecture and instruction set described
above is to augment the data path with extra execution units We refer to processors that
follow this approach as “Enhanced Conventional DSPs''; their basic architecture is similar
to previous generations of DSPs, but has been enhanced by adding execution units.
Lucent Technologies DSP16000 architecture is based on that of the earlier DSP1600, but
Lucent added a second multiplier, an adder , and a bit manipulation unit. To support more
parallel operations and keep the processor from starving for data, Lucent also increased the
data bus widths to 32 bits. The net result is a processor that is able to sustain a throughput
of two multiply-accumulates per instruction cycle.
EXECUTING MULTIPLE INSTRUCTIONS / CYCLE

A few designers have opted for a more RISC-like instruction set coupled with an
architecture that supports execution of multiple instructions in every instruction cycle .E.g.
TMS320C62xx family. In TI's version, the processor fetches a 256-bit instruction ``packet,''
parses the packet into eight 32-bit instructions, and routes them to its eight independent
execution units.
VLIW processors typically suffer from high program memory requirements and high
power consumption. Like VLIW processors, superscalar processors issue and execute
multiple instructions in parallel. Unlike VLIW processors, in which the programmer
explicitly specifies which instructions will be executed in parallel, superscalar processors

use dynamic instruction scheduling to determine ``on the fly'' which instructions will be
executed concurrently based on the processor's available resources, on data dependencies,
and on a variety of other factors. Superscalar architectures have long been used in high-
performance general-purpose processors such as the Pentium and PowerPC.
CIRCULAR BUFFERING
In off-line processing, the entire input signal resides in the computer at the same time. The
key point is that all of the information is simultaneously available to the processing
program. This is common in scientific research and engineering, but not in consumer
products. Off-line processing is the realm of personal computers and mainframes.
In real-time processing, the output signal is produced at the same time that the input signal
is being acquired. To calculate the output FIR sample, we must have access to a certain
number of the most recent samples from the input.. When a new sample is acquired, it
replaces the oldest sample in the array, and the pointer is moved one address ahead.
Circular buffers are efficient because only one value needs to be changed when a new
sample is acquired.
Four parameters are needed to manage a circular buffer. First, there must be a pointer that
indicates the start of the circular buffer in memory. Second, there must be a pointer
indicating the end of the array , or a variable that holds its length . Third, the step size of
the memory addressing must be specified. These three values define the size and
configuration of the circular buffer, and will not change during the program operation. The
fourth value, the pointer to the most recent sample, must be modified as each new sample is
acquired. In other words, there must be program logic that controls how this fourth value is
updated based on the value of the first three values.
DSP/MICROCONTROLLER HYBRIDS
Many applications require a mixture of control-oriented software and DSP software. A
prime example is the digital cellular phone, which must implement both supervisory tasks
and voice-processing tasks. In general, microcontrollers provide good performance in
controller tasks and poor performance in DSP tasks; dedicated DSP processors have the
opposite characteristics. Hence, until recently, combination controller/signal processing
applications were typically implemented using two separate processors: a microcontroller
and a DSP.

In the past couple of years, however, a number of microcontroller vendors have begun to
offer DSP-enhanced versions of their microcontrollers as an alternative to the dual-
processor solution.
Using a single processor to implement both types of software is attractive, because it can
potentially:
• simplify the design task

• save board space
• reduce total power consumption
• reduce overall system cost
Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2

microcontroller. This version is called the SH-DSP, and adds a complete 16-bit fixed-point
DSP data path to the SH-2. In contrast, ARM took a different approach and developed a
DSP co-processor, ``Piccolo,'' that is meant to be used as an add-on to their ARM7
microcontroller and each has its own instruction set and processes its own instruction
stream. It is therefore possible for the two processors to operate in parallel with the caveat
that Piccolo relies on the ARM7 to perform all data transfers.
RECONFIGURABLE ARCHITECTURES
Reconfigurable architectures are defined as programmable architectures that change the
hardware or the interconnections dynamically so as to provide flexibility with simultaneous
benefits in execution time due to the reconfiguration as opposed to turning off units to
conserve power. There have been various approaches to provide and use this
reconfigurability in programmable architectures. The first approach is the ’FPGA+’
approach, which adds a number of high-level configurable functional blocks to a general
purpose device to optimize it for a specific purpose such as wireless . The second approach
is to develop a reconfigurable system around a programmable ASSP. The third approach is
based on a parallel array of processors on a single die, connected by a reconfigurable
fabric. These kind of architectures are just in their initial form of evolution.

CHAPTER 5
NOVEL DSP ARCHITECTURES
"POST-HARVARD" TECHNOLOGY
After remaining unchanged for more than a decade, DSP architectures have started to
evolve. They are even trying to encompass control operations. Conventional DSP
architecture typically uses Harvard-style architecture, with separate data and instruction
buses. Their main processing elements are a multiplier, an arithmetic logic unit (ALU), and
an accumulation register, allowing creation of a multiply-accumulate (MAC) unit that
accepts two operands. Depending on the processor, the operands may be 16-, 24-, 32-, or
48-bit words in either fixed-point or floating-point format. Whatever the word width, these
conventional DSPs offer fixed-width instructions, executing one instruction per clock
cycle.
Figure : Â The conventional DSP

architecture uses separate data
and memory buses and features
fixed-width instructions,
executing one instruction per
clock cycle.
The instructions themselves can be fairly complex. A single instruction may embody two
data moves, a MAC operation, and two pointer updates. These complex instructions help
the conventional DSP offer a high degree of code density when performing repeated
mathematical operations on arrays of numbers. As control devices, however, they leave
something to be desired. The fixed-width instructions are inefficient when tasked with
performing simple counter increments as part of a control loop, for instance. Even if the
counter is only going as high as 10, the processor needs to use the full word width for the
values. Conventional DSPs are also weak at bit-level data manipulation beyond bit shifting.
Still, because of their number-crunching proficiency, conventional DSPs soon gained
popularity in communications and media applications. The communications devices,
including modem and telephony processors, needed the computational power for echo
canceling, voice coding, and filtering. Media applications, including digital audio, video,
and imaging, needed computational power for compression and filtering along with
program flexibility to track evolving standards. DSPs also found a home in disk-drive and
other servo-motor-control applications.
ENHANCED DSPS EMERGE

As semiconductor process technology evolved, conventional DSPs began to acquire a
number of on-chip peripherals such as local memory, I/O ports, timers, and DMA
controllers. Their basic architecture, however, didn't change for more than a decade.
Eventually, though, the relative weakness in bit-level manipulation began to catch up with
conventional DSPs, as did the incessant demand for greater performance.

One common feature of these enhanced DSPs is the presence of a second MAC, which
allows for some parallelism in computation. In many cases, this parallelism extends to
other elements in the DSP, allowing the device to perform single-instruction, multiple-data
(SIMD) operations. Often this is accomplished with data packing, which allows registers,
data paths, and the like to handle two half-word operands each clock cycle. Along with data
packing, many enhanced DSPs allow the instructions themselves to use fractional word
widths, which allows multiple instructions to launch simultaneously.
The enhanced DSPs also tend to incorporate features that speed execution of algorithms in
a specific application space as well as add special-purpose peripherals and memory. The
exact nature of the specialization varies with the application an enhanced DSP targets,
which makes direct comparisons difficult. Many include hardware accelerators for
frequently-used operations as well as provide specialized addressing modes and augmented
instruction sets that target the application space. The augmented instruction sets may
include both special DSP instructions and RISC-like instructions for improved control
operation.
Consider, for instance, the Blackfin DSP family from Analog Devices. This family targets
voice, video, and data communications signal processing along with control operations.
The core includes dual 16-bit MACs, dual 40-bit arithmetic logic units (ALUs), a 40-bit
barrel shifter, and quad 8-bit ALUs for video operations. Because the architecture allows
data packing, the 40-bit ALUs can handle two 40-bit numbers or four 16-bit numbers. In
addition, a control unit handles sequencing of instructions so that a mix of 16-bit control
and 32-bit DSP instructions can pack for simultaneous execution. Data can be in 8-, 16-, or
32-bit format.
Figure: Analog Devices Blackfin DSP architecture handles multi-width data words
and can simultaneously execute 16-bit control and 32-bit DSP instructions.
The core also includes two data address generators (DSGs) to simplify both DSP and
control operations. DSP addressing operations include circular buffering, for matrix
operations, and bit-reversal, for unscrambling FFT results. Control operations include auto-
increment, auto-decrement, and base-plus-immediate-offset addressing modes not found in
conventional DSPs.

INSTRUCTION SETS TARGET APPLICATIONS

The instruction set of the Blackfin core includes both general DSP instructions and RISC-
like control instructions. In addition, the core has complex instructions geared toward the
needs of the intended applications. For Huffman coding, used in communications
algorithms, there is a "Field Deposit/Extract" command. For the Discrete Cosine
Transform, used in imaging and video, an IEEE 1180 rounding operation is available.
Video compression algorithms can take advantage of the "Sum Absolute Difference"
instruction.
These specialty instructions are one way that the Blackfin family targets applications. The
other way is the peripheral mix each family member offers. The ADSP-21532, for
example, aims at low-cost consumer multimedia applications by including peripherals
supporting surround-sound and video-specific operating modes. The ADSP-21535 goes
after high-performance communications applications with USB and PCI interfaces as well
as substantial amounts of on-chip SRAM.
The range and variety of variations within the Blackfin family as well as the nature of its
specialized instructions mirror the diversity of enhanced conventional DSPs, available from
companies such as Cirrus Logic, Motorola, and Texas Instruments. But for all the
enhancements, these DSPs follow basically the same programming model as the
conventional device.
Other DSP architectures have emerged that follow a different programming model. In
search of the highest performance levels, these architectures allow the DSP to launch
multiple instructions at the same time for parallel execution. While these approaches result
in greater code execution speed, they also make software more difficult to optimize. They
require careful instruction ordering to avoid needing simultaneous access to the same data.
They also need to avoid attempting simultaneous execution of instructions where one
instruction depends on the results of the other for its operands. Not all DSP application
software has a structure suitable for multiple-launch execution, but when it does, these
DSPs offer the highest performance.
PARALLELISM ARISES
Two different forms of multiple-launch DSPs have arisen: very long instruction word
(VLIW) and superscalar architectures. Both have multiple execution units configured to
operate in parallel and use RISC-like instruction sets. The instructions of a VLIW
architecture are explicitly parallel, being composed of several sub-instructions that control
different resources. The superscalar architectures, on the other hand, load instructions in
bulk, then use hardware run-time scheduling to identify instructions that can run in parallel
and map them to the execution units.
Of the multi-launch architectures, VLIW designs are the most common. Devices from
Adelante Technologies, Equator Technologies, Siroyan, and Texas Instruments fall into
this category, although they vary considerably with the type and number of parallel
execution units they offer. The TI TMS320C64xx processors, for instance, have eight
execution units that can handle both 8- and 16-bit SIMD operations. The Siroyan OneDSP,
on the other hand, is scalable from two to 32 clusters, each with several execution units.
The Adelante Saturn DSP core as shown in the following figure demonstrates the essence
of the VLIW approach. It uses multiple data buses in a dual-Harvard configuration to

deliver data and 96-bit wide instructions to an array of execution units simultaneously.
These units include two multipliers (MPY), four 16-bit ALUs that can combine to form
two 20-bit ALUs, a barrel shifter with saturation logic (SST/BRS), program (PCU) and
loop (LCU) controllers, address controllers (ACU), and an ability for design teams to add
application-specific execution units (AXU) to speed processing.
Figure: Adelante's Saturn DSP core handles VLIW instructions that can comprise
several sub-instructions that control different resources. The core also handles
application-specific execution units (AXUs) to accelerate processing.
The Saturn core uses a unique approach to get around one of the problems the wide word
widths of VLIW architectures cause. Accessing external memory is a challenge for these
DSPs, because of their need to work with buses that can be as wide as 128 bits. The Saturn
core uses 16-bit program memory, which it maps into the 96-bit instruction word it uses
internally. Adelante developed this mapping after analyzing millions of lines of code for
common applications. However, the core also allows developers to create their own
application-specific instructions that map into the VLIW.
SUPERSCALAR DSPS
While the 16-bit external instruction width of the Saturn processor is unusual for VLIW
architectures, it is typical for superscalar architectures. These devices pull in several
instructions at a time and dynamically map them to the execution units. Internally the effect
is much the same as a VLIW architecture in that execution units are operating in parallel.
But from the software development viewpoint the approach reduces programming
complexity. With hardware handling the sequencing and arranging of instructions, the
developer is free to work with the more manageable short instructions.
The structure of a sample superscalar DSP, the LSI Logic ZSP600. Because it is a core its
memory interface isn't constrained, making it look like a VLIW architecture. But the
presence of the instruction-sequencing unit (ISU) and the pipeline control unit betray its
superscalar nature. The ZSP600 fetches eight instructions at a time, and can execute as
many as six, using its four MAC and two ALU execution units simultaneously. Data
packing allows the units to perform 16- or 32-bit operations. The architecture also allows
for the addition of coprocessors to speed specific DSP functions.

Figure: Superscalar DSPs, such as LSI Logic's ZSP600, use several instructions
simultaneously and dynamically map these instructions to the execution units.
This ability to add coprocessors is becoming a common feature of high-performance DSP

cores. In many cases the core's creators have also created coprocessors for functions such
as DES (data encryption standard) and Viterbi coding. If a pre-designed coprocessor isn't
available, however, creating your own can be a major design challenge.
A recently-introduced DSP architecture, the PulseDSP from Systolix, might make the task
easier. Similar to an FPGA, the PulseDSP offers a massively parallel, repetitive structure. It
is designed as a systolic array, which means that all data transfers occur synchronously on a
clock edge. Each processing element in the array has selectable I/O paths, local data
memory, and an ALU. Both the I/O and the ALU are programmable, and the array has a
programming bus running through it. The combination makes the array reprogrammable,
either statically or dynamically. The array structure is intended to handle low-complexity
but high-speed processing tasks using 16- to 64-bit arithmetic, which makes it suitable as a
coprocessor.
Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as a

standalone unit for applications such as filters and FFTs. The array is programmable,
with each processing element having its own selectable I/O paths, local data memory,
and an ALU.

The array can also be used as a stand-alone processor for some types of algorithms, such as
filters and FFTs. One of the commercial implementations of the array, in fact, is to provide
filtering in an Analog Devices data acquisition part, the AD7725. The device combines the
PulseDSP with a sigma-delta A/D converter to provide post-processing of the acquired
data. The DSP array implements various filter algorithms.
Innovations such as the PulseDSP as well as the proliferation within the other DSP
architectures are a strong indicator of how important these once-arcane processors have
become. In many applications, especially communications, they share the spotlight with the
RISC processor. The DSP handles the data and the RISC handles the protocols. There are
problems with the two-processor approach, of course, including increased cost and
software development complexity. One reason many DSPs are adding RISC-like
instructions to their set is to be able to edge out the other processor in such applications.
The same thing is happening with some RISC processors. Extensible cores, such as the
Tensilica Xtensa and the ARC International ARCtangent, are offering DSP enhancements
so that communications applications need only one processor. These enhancements follow
the architecture of the conventional DSP, but merge the DSP functions into the instruction
set of the RISC core.
The ARCtangent,, demonstrates how the two get blended. The DSP instruction decode and
processing elements both connect with the rest of the core, allowing them to use the core's
resources as well as their own. The extensions have full access to registers and operate in
the same instruction stream as the RISC core. ARC's DSP offerings include MACs in
varying widths, saturation arithmetic, and X-Y memory for DSP data. The extensions also
support DSP addressing modes such as bit-reversal.
Figure : Â The ARCtangent core from ARC International blends DSP functionality
into a RISC processor.
Both DSP instruction-decode and processing elements connect with the rest of the core,
allowing these elements to use the coreï¿½s resources as well as their own.
These extended RISC processors, enhanced conventional DSPs, and high-performance
architectures have all proliferated in the last few years, a sure sign of the importance DSPs
have acquired. Furthermore, that proliferation is likely to continue. With process
technology allowing integration of multiple peripherals with DSP cores and instruction sets
extending to match application needs, DSPs are headed the way of the microcontroller.
From obscure, specialized parts, they are evolving to become a fundamental building block
for virtually any system.

CHAPTER 6
ARCHITECTURE OF LATEST DSP PROCESSORS
TEXAS INSTRUMENTS TMS320C67xx FAMILY
OVERVIEW
The TMS320C67xx family is the Highest Performance Floating-Point version DSPs. It is

based on a advanced VelociTI very-long-instruction-word (VLIW) architecture
making this DSP an excellent choice for multichannel and multifunction applications,
which allows it to execute up to eight RISC-like instructions per clock cycle. It has added
support for floating-point arithmetic and 64-bit data. It has a performance of up to 1 giga
floating-point operations per second (GFLOPS) at a clock rate of 167 MHz.It uses an
1.8-volt core supply , and executes up to 334 million MACs per second at 167 MHz. The
TMS320C67xx's two data paths extend hardware support for 64-bit data and IEEE-754
32-bit single-precision and 64-bit double-precision floating-point arithmetic. Each data
path includes a set of four execution units, a general-purpose register file, and paths for
moving data between memory and registers.
The four execution units in each data path comprise two ALUs, a multiplier, and an
adder/subtractor which is used for address generation. The ALUs support both integer and
floating point operations, and the multipliers can perform both 16x16-bit and 32x32-bit
integer multiplies and 32-bit and 64-bit floating point multiplies. The two register files each
contain sixteen 32-bit general-purpose registers. These registers can be used for storing
addresses or data. To support 64-bit floating point arithmetic, pairs of adjacent registers can
be used to hold 64-bit data.
The C6701 DSP possesses the operational flexibility of high-speed controllers and the
numerical capability of array processors. This processor has 32 general-purpose registers of
32-bit word length and eight highly independent functional units. The eight functional units
provide four floating-/fixed-point ALUs, two fixed-point ALUs, and two floating-/fixed-
point multipliers. Program memory consists of a 64K-byte block that is user-configurable
as cache or memory-mapped program space. Data memory consists of two 32K-byte
blocks of RAM. The peripheral set includes two multichannel buffered serial ports
(McBSPs), two general-purpose timers, a host-port interface (HPI), and a glueless external
memory interface (EMIF) capable of interfacing to SDRAM or SBSRAM and
asynchronous peripherals.
The large bank of On-chip memory system of the TMS320C67xx implements a modified
Harvard architecture, providing separate address spaces for program and data memory.
Program memory has a 32-bit address bus and a 256-bit data bus. Each of the two data
paths is connected to data memory by a 32-bit address bus and two 32-bit data buses. Since
there are two 32-bit data buses for each data path, the TMS320C67xx can load two 64-bit
words per instruction cycle. TMS320C6701 has 64 Kbytes of 32-bit on-chip program RAM
and 64 Kbytes of 16-bit on-chip data RAM.
The TMS320C6701 has one external memory interface, which provides a 23-bit address
bus and a 32-bit data bus. These buses are multiplexed between program and data memory
accesses. Addressing modes supported include register-direct, register-indirect, indexed
register-indirect, and modulo addressing. Immediate data is also supported.

The TMS320C67xx does not support hardware looping, and hence all loops must be
implemented in software. However, the parallel architecture of the processor allows the
implementation of software loops with virtually no overhead.
The peripherals on the TMS320C6701 include a host port, four-channel DMA controller,
two TDM-capable buffered serial ports and two 32-bit timers
CPU ARCHITECTURE

CPU DESCRIPTION
Fetch packets are always 256 bits wide; however, the execute packets can vary in size. The
variable-length execute packets are a key memory-saving feature, distinguishing the C67x
CPU from other VLIW architectures.
The CPU features two sets of functional units. Each set contains four units and a register
file. One set contains functional units .L1, .S1, .M1, and .D1; the other set contains units
.D2, .M2, .S2, and .L2. The two register files contain 16 32-bit registers each for the total
of 32 general-purpose registers. The two sets of functional units, along with two register
files, compose sides A and B of the CPU
The four functional units on each side of the CPU can freely share the 16 registers
belonging to that side. Additionally, each side features a single data bus connected to all
registers on the other side, by which the two sets of functional units can access data from
the register files on opposite sides.
In addition to the C62x DSP fixed-point instructions, the six out of eight functional units
(.L1, .M1, .D1, .D2, .M2, and .L2) also execute floating-point instructions. The remaining
two functional units (.S1 and .S2) also execute the new LDDW instruction which loads 64
bits per CPU side for a total of 128 bits per cycle.
Another key feature of the C67x CPU is the load/store architecture, where all instructions
operate on registers. Two sets of data-addressing units (.D1 and .D2) are responsible for all
data transfers between the register files and the memory. The data address driven by the .D
units allows data addresses generated from one register file to be used to load or store data
to or from the other register file. The C67x CPU supports a variety of indirect-addressing
modes using either linear- or circular-addressing modes with 5- or 15-bit offsets. All
instructions are conditional, and most can access any one of the 32 registers. Some
registers, however, are singled out to support specific addressing or to hold the condition
for conditional instructions. The two .M functional units are dedicated multipliers.
The two .S and .L functional units perform a general set of arithmetic, logical, and branch
functions with results available every clock cycle. The processing flow begins when a 256-
bit-wide instruction fetch packet is fetched from a program memory. The 32-bit
instructions destined for the individual functional units are “linked” together by “1” bits in
the least significant bit (LSB) position of the instructions. The instructions that are
“chained” together for simultaneous execution compose an execute packet. A “0” in the
LSB of an instruction breaks the chain, effectively placing the instructions that follow it in
the next execute packet. If an execute packet crosses the fetch-packet boundary (256 bits
wide), the assembler places it in the next fetch packet, while the remainder of the current
fetch packet is padded with NOP instructions. The number of execute packets within a
fetch packet can vary from one to eight. Execute packets are dispatched to their respective
functional units at the rate of one per clock cycle and the next 256-bit fetch packet is not
fetched until all the execute packets from the current fetch packet have been dispatched.
After decoding, the instructions simultaneously drive all active functional units for a
maximum execution rate of eight instructions every clock cycle. While most results are
stored in 32-bit registers, they can be subsequently moved to memory as bytes or half-
words as well. All load and store instructions are byte-, half-word, or word-addressable.

ANALOG DEVICES ADSP-21XX FAMILY
OVERVIEW
The ADSP-21xx is the first single chip DSP processor family from Analog Devices. The
family consists of a large number of processors based on a common 16-bit fixed-point
architecture core with a 24-bit instruction word. Each processor combines the core DSP
architecture computation units, data address generators, and program sequencer—with
differentiating features such as on-chip program and data memory RAM, a programmable
timer, and one or two serial ports.
The fastest members of the family operate at 75 MIPS at 2.5 volts, 52 MIPS at 3.3 volts,
and 40 MIPS at 5.0 volts. Analog Devices has recently announced the ADSP-219x series,
which offers projected speeds of up to 300 MIPS, as well as architectural enhancements.
ADSP-21xx processors are targeted at modem, audio, PC multimedia, and digital cellular
applications.
Fabricated in a high speed, submicron, double-layer metal CMOS process, the highest-
performance ADSP-21xx processors operate at 25 MHz with a 40 ns instruction cycle time.
Every instruction can execute in a single cycle. Fabrication in CMOS results in low power
dissipation. The ADSP-2100 Family’s flexible architecture and comprehensive instruction
set support a high degree of parallelism.
The ADSP-21xx data path consists of three separate arithmetic execution units: an
arithmetic/logic unit (ALU), a multiplier/accumulator (MAC), and a barrel shifter. Each
unit is capable of single-cycle execution, but only one of these units can be active during a
single instruction cycle. The ALU operates on 16-bit data. In addition to the usual ALU
operations, the ALU provides increment/decrement, absolute value, and add-with-carry
functions. ALU results are saturated upon overflow if the appropriate configuration bit is
set by the programmer. The MAC unit includes a 16x16->32-bit multiplier, four input
registers, a feedback register, a 40-bit adder, and a single 40-bit result register/accumulator
providing eight guard bits. Besides signed operands, the multiplier can operate on
unsigned/unsigned or on signed/unsigned operands, thus supporting multi-precision
arithmetic. The barrel shifter shifts 16-bit inputs from an input register or from the
ALU/MAC/barrel shifter result registers into a 32-bit result register. Logical and arithmetic
shifts are supported left or right up to 32 bits. The barrel shifter also supports block
floating-point arithmetic with block exponent detect (which determines a maximum
exponent of a block of data), single-word exponent detect, normalize, and exponent adjust
instructions.
ADSP-21xx processors use a modified Harvard architecture with separate memory spaces
and on-chip bus sets for program and data. All processors in the ADSP-21xx family
include on-chip program RAM or ROM and on-chip data RAM.
On-chip program memory can be used for both instructions and data, and it can be
accessed via a 14-bit address bus and a 24-bit data bus. On-chip program memory is dual-
ported to allow the processor to fetch both a data operand and the next instruction in a
single instruction cycle. The on-chip data memory can be accessed via a 14-bit address bus
and a 16-bit data bus. One access to the on-chip data memory can be performed in a single
instruction cycle. Three memory accesses (one instruction and two data operands) can be
performed in one instruction cycle.

Both of the on-chip memory spaces can be extended off-chip. All ADSP-21xx processors
have one external memory interface, providing a 14-bit address bus and a 24-bit data bus.
This external interface is multiplexed between program and data memory accesses.
The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing

modes. Immediate data is also supported. The ADSP-21xx provides zero-overhead
program looping through its DO instruction. Any length sequence of instructions can be
contained in a hardware loop, and up to 16,384 repetitions are supported.
ARCHITECTURE OVERVIEW
The processors contain three independent computational units: the ALU, the
multiplier/accumulator (MAC), and the shifter. The ALU performs a standard set of
arithmetic and logic operations; division primitives are also supported. The MAC performs
single-cycle multiply, multiply/add, and multiply/subtract operations. The shifter performs
logical and arithmetic shifts, normalization, renormalizations, and derive exponent
operations. The shifter can be used to efficiently implement numeric format control
including multiword floating-point representations. The internal result (R) bus directly
connects the computational units so that the output of any unit may be used as the input of
any unit on the next cycle. A powerful program sequencer and two dedicated data address
generators ensure efficient use of these computational units. The sequencer supports
conditional jumps, subroutine calls, and returns in a single cycle. With internal loop
counters and loop stacks, the ADSP-21xx executes looped code with zero overhead no
explicit jump instructions are required to maintain the loop. Two data address generators
(DAGs) provide addresses for simultaneous dual operand fetches (from data memory and
program memory). Each DAG maintains and updates four address pointers. Whenever the
pointer is used to access data (indirect addressing), it is post-modified by the value of one
of four modify registers. A length value may be associated with each pointer to implement
automatic modulo addressing for circular buffers. The circular buffering feature is also
used by the serial ports for automatic data transfers to On chip memory. Efficient data
transfer is achieved with the use of five internal buses namely : Program Memory Address
(PMA) Bus , Program Memory Data (PMD) Bus, Data Memory Address (DMA) Bus,
Data Memory Data (DMD) Bus and the Result (R) Bus.

The two address buses (PMA, DMA) share a single external address bus, allowing memory
to be expanded off-chip, and the two data buses (PMD, DMD) share a single external data
bus. The BMS, DMS, and PMS signals indicate which memory space is using the external
buses. Program memory can store both instructions and data, permitting the ADSP-21xx to
fetch two operands in a single cycle, one from program memory and one from data
memory. The processor can fetch an operand from on-chip program memory and the next
instruction in the same cycle. The memory interface supports slow memories and
memorymapped peripherals with programmable wait state generation. External devices can
gain control of the processor’s buses with the use of the bus request/grant signals.
One bus grant execution mode (GO Mode) allows the ADSP-21xx to continue running
from internal memory. A second execution mode requires the processor to halt while buses
are granted. Each ADSP-21xx processor can respond to several different interrupts. There
can be up to three external interrupts, configured as edge- or level-sensitive. Internal
interrupts can be generated by the timer, serial ports, and, on the ADSP-2111, the host
interface port. There is also a master RESET signal. Booting circuitry provides for loading
on-chip program memory automatically from byte-wide external memory. After reset, three
wait states are automatically generated. This allows, for example, a 60 ns ADSP-2101 to
use a 200 ns EPROM as external boot memory. Multiple programs can be selected and
loaded from the EPROM with no additional hardware. The data receive and transmit pins
on SPORT1 (Serial Port 1) can be alternatively configured as a general-purpose input flag
and output flag. You can use these pins for event signalling to and from an external device.
A programmable interval timer can generate periodic interrupts. A 16-bit count register
(TCOUNT) is decremented every n cycles, where n–1 is a scaling value stored in an 8-bit
register (TSCALE). When the value of the count register reaches zero, an interrupt is
generated and the count register is reloaded from a 16-bit period register (TPERIOD).

BLACKFIN PROCESSOR
Blackfin Processors are a new breed of embedded media processor. Based on the Micro
Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin Processors
combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC)
signal processing functionality with the ease-of-use attributes found in general-purpose
microcontrollers. This combination of processing attributes enables Blackfin Processors to
perform equally well in both signal processing and control processing applications-in many
cases deleting the requirement for separate heterogeneous processors.
This processor family also offers industry leading power consumption performance to as
low as 0.15mW/MMAC at 0.8V. This combination of high performance and low power is
essential in meeting the needs of today's and future signal processing applications including
broadband wireless, audio/video capable Internet appliances, and mobile communications.
HIGH PERFORMANCE SIGNAL PROCESSING
The core architecture employs fully interlocked instruction pipeline, multiple parallel
computational blocks, efficient DMA capability, and instruction set enhancements
designed to accelerate video processing
FULLY INTERLOCKED INSTRUCTION PIPELINE
All Blackfin Processors utilize a multi-stage fully interlocked pipeline that guarantees code
is executed as you would expect and that all data hazards are hidden from the programmer.
This type of pipeline guarantees result accuracy by stalling when necessary to achieve
proper results.
HIGHLY PARALLEL COMPUTATIONAL BLOCKS
The basis of the Blackfin Processor architecture is the Data Arithmetic Unit that includes
two 16-bit Multiplier Accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs),
four 8-bit video ALUs, and a single 40-bit barrel shifter. Each MAC can perform a 16-bit
by 16-bit multiply on four independent data operands every cycle. The 40-bit ALUs can
accumulate either two 40-bit numbers or four 16-bit numbers. With this architecture, 8-,
16- and 32-bit data word sizes can be processed natively for maximum efficiency.
Two Data Address Generators (DAGs) are complex load/store units designed to generate
addresses to support sophisticated DSP filtering operations. For DSP addressing, bit-
reversed addressing and circular buffering is supported. The DAGs also include two loop
counters for nested zero overhead looping and hardware support for on-the-fly saturation
and clipping.
HIGH BANDWIDTH DMA CAPABILITY
All Blackfin Processors have multiple, independent DMA controllers that support
automated data transfers with minimal overhead from the processor core. DMA transfers
can occur between the internal memories and any of the many DMA-capable peripherals.
VIDEO INSTRUCTIONS
In addition to native support for 8-bit data, the word size common to many pixel processing
algorithms, the Blackfin Processor architecture includes instructions specifically defined to

enhance performance in video processing applications. Video compression algorithms are

incorporated for the enhanced instructions.
Efficient Control Processing is similar to that of RISC control processors. These features
include a hierarchical memory architecture, superior code density, and a variety of
microcontroller-style peripherals including a watch-dog timer, real-time clock, and an
integrated SDRAM controller. The L1 memory is connected directly to the processor core,
runs at full system clock speed, and offers maximum system performance for time critical
algorithm segments. The L2 memory is a larger, bulk memory storage block that offers
slightly reduced performance, but still faster than off-chip memory.
The L1 memory structure has been implemented to provide the performance needed for
signal processing while offering the programming ease found in general purpose
microcontrollers. By supporting both SRAM and cache programming models, system
designers can allocate critical DSP data sets that require high bandwidth and low latency
into SRAM, while maintaining the simple programming model of the data cache for
operating system (OS) and microcontroller code.
The Memory Management Unit provides for a memory protection format that can support a
full OS Kernel. The OS Kernel runs in Supervisor mode and partitions blocks of memory
and other system resources for the actual application software to run in User mode. This is
a unique and powerful feature not present on traditional DSPs.
SUPERIOR CODE DENSITY
The Blackfin Processor architecture supports multi-length instruction encoding. Very

frequently used control-type instructions are encoded as compact 16-bit words, with more
mathematically intensive DSP instructions encoded as 32-bit values.
DYNAMIC POWER MANAGEMENT
They employ multiple power saving techniques. Blackfin Processors are based on a gated
clock core design that selectively powers down functional units on an instruction-by-
instruction basis. They also support multiple power-down modes for periods where little or
no CPU activity is required. Lastly, and probably most importantly, Blackfin Processors
support a dynamic power management scheme whereby the operating frequency AND
voltage can be tailored to meet the performance requirements of the algorithm currently
being executed.
BLACKFIN PROCESSOR CORE BASICS
The Blackfin Processor core is a load-store architecture consisting of a Data Arithmetic

Unit, an Address Arithmetic Unit, and a sequencer unit. Blackfin Processors combine a
high performance, dual MAC DSP architecture with the programming ease of a RISC
MCU into a single, instruction set architecture.

GENERAL PURPOSE REGISTER FILES
The Blackfin Processor core includes an 8-entry by 32-bit data register file for general use
by the computational units. Supported data types include 8-, 16-, or 32-bit signed or
unsigned integer and 16- or 32-bit signed fractional. In every clock cycle, this multiported
register file supports two 32-bit reads AND two 32-bit writes. It can also be accessed as a
16-entry by 16-bit data register file.
The address register file provides a general purpose addressing mechanism in addition to
supporting circular buffering and stack maintenance. This register file consists of 8 entries
and includes a frame pointer and a stack pointer. The frame pointer is useful for subroutine
parameter passing, while the stack pointer is useful for storing the return address from
subroutine calls.
DATA ARITHMETIC UNIT
It contains:
• Two 16-bit MACs

• Two 40-bit ALUs
• Four 8-bit video ALUs
• Single barrel shifter
All computational resources can process 8-, 16-, or 32-bit operands from the data register
file-R0 through R7. Each register can be accessed as a 32-bit register or a 16-bit register
high or low half.
In a single clock cycle, the dual data paths can read AND write up to two 32-bit values.
However, since the high and low halves of the R0 through R7 registers are individually
addressable (Rx, Rx.H, or Rx.L), each computational block can choose from either two 32-
bit input values or four 16-bit input values with no restrictions on input data. The results of
the computation can be written back into the register file as either a 32-bit entity or as the
high or low 16-bit half of the register. Additionally, the method of accumulation can vary
between data paths..

Both accumulators are 40 bits in length, providing 8 bits of extended precision. Similar to
the general purpose registers, both accumulators can be accessed in 16-, 32-, or 40-bit
increments. The Blackfin architecture also supports a combined add/subtract instruction
that can generate two 16-, 32-, or 40-bit results or four 16-bit results. In the case where four
16-bit results are desired, the high and low half results can be interchanged. This is a very
powerful capability and significantly improves, for instance, the FFT benchmark results.
ADDRESS ARITHMETIC UNIT
Two data address generators (DAGs) provide addresses for simultaneous dual operand
fetches from memory. The DAGs share a register file that contains four sets of 32-bit index
(I), length(L), base(B), and modify(M) registers. There are also eight additional 32-bit
address registers—P0 through P5, frame pointer, and stack pointer that can be used as
pointers for general indexing of variables and stack locations.
The four sets of I, L, B, and M registers are useful for implementing circular buffering.
Used together, each set of index, length, and base registers can implement a unique circular
buffer in internal or external memory. The Blackfin architecture also supports a variety of
addressing modes, including indirect, auto increment and decrement, indexed, and bit
reversed. Last, all address registers are 32 bits in length, supporting the full 4 Gbyte
address range of the Blackfin Processor architecture.
PROGRAM SEQUENCER UNIT
The program sequencer controls the flow of instruction execution and supports conditional
jumps and subroutine calls, as well as nested zero-overhead looping. A multistage fully
interlocked pipeline guarantees code is executed as expected and that all data hazards are
hidden from the programmer. This type of pipeline guarantees result accuracy by stalling
when necessary to achieve proper results.
The Blackfin architecture supports 16- and 32-bit instruction lengths in addition to limited
multi-issue 64-bit instruction packets. This ensures maximum code density by encoding the
most frequently used control instructions as compact 16-bit words and the more
challenging math operations as 32-bit double words.

LSI LOGIC ZSP600-QUAD MAC SUPERSCALAR CORE

OVERVIEW
The ZSP600 is a quad MAC superscalar DSP core that addresses the high performance
data throughput and signal processing requirements of emerging communications
platforms. The ZSP600 supports up to Six IPC DSP performance at a peak 300MHz data
rate. It includes quad MAC and quad ALU computational resources, a high-performance
load/store memory architecture, and dedicated co-processor interfaces, combined with
state-of-the-art power reduction techniques. These attributes make the ZSP600 core an
ideal solution for a variety of embedded DSP algorithms, including those required for
wireless infrastructure, mobile (3G), IAD/home gateway, central office, and
access/network applications.ZSP600 instruction parallelism is supported by user-
transparent instruction grouping and pipeline control to deliver superscalar DSP
performance, while programming using a RISC-instruction set..
The ZSP600 is a fully synthesizable, single-phase, clocked architecture, with all core I/Os
registered for ease-of-process migration and design flexibility. The ZSP600 provides
extensive computational resources, including four 16-bit multipliers/MACs, dual 40-bit
ALUs, and dual 16-bit ALUs, all capable of supporting 16-and 32-bit operations. The
ZSP600 can perform four independent 16x16 MUL/MAC operations into four 16-bit or
two 40-bit results, two 32x32-bit MUL/MACs into a 32-bit result, or two Viterbi (add-
compare-select) results per cycle. The ZSP600 is based upon a high-bandwidth memory
architecture with separate 8 instruction per cycle prefetch and dual 64-bit data
interfaces , over a 24-bit address space. The instruction memory architecture allows multi-
instruction/cycle pre-fetch to an integrated instruction cache. The data memory architecture
incorporates dual independent 64-bit load/store units, with dedicated address generation,
allowing up to eight 16-bit word or four 32-bit word load/store operations per cycle. The
ZSP600 integrates a bi-directional co-processor interface to support hardware acceleration.
The memory subsystem (MSS) is decoupled from the DSP operations to provide increased
flexibility in support of different memory schemes. It also includes Instruction Set
enhancements to RISC architecture for improved broadband and wireless application
support..

A WORD ON SUPERSCALAR DSP
A superscalar architecture simply implies that the architecture is responsible for resolving
the operand and resource hazards and that it has the resources to achieve an instruction
throughput that is greater that one instruction per clock. Logic dedicated to pipeline control
is kept to a minimum by enforcing in-order execution and by isolating the control to a
single stage at the head of the pipeline. This stage issues sequential groups of instructions
that have no data dependencies or other resource conflicts. Once a group of instructions has
been issued, they advance through the pipeline in lock step.
A VLIW machine does not employ instruction scheduling or pipeline protection.

Instructions in a VLIW pipeline are statically issued , and it is the programmer’s
responsibility to prevent data hazards and resource conflicts. Superscalar architectures also
facilitate software compatibility not only between implementations of the same
architecture, but also from one generation of architecture to the next thus increasing
lifetime.
The G2 architecture is scalable in terms of arithmetic resources, data bandwidth, and

pipeline capacity. This scalable nature allows the architecture to support multiple
implementations that target different application spaces.
All address and data I/O communication across the core boundary are registered. This
feature is highly desirable from a SOC system designer’s point of view for a number of
reasons, one being the removal of timing budget ambiguities between system logic and the
core.
Prefetch unit (PFU) is at the head of the instruction pipeline. The ZSP600 can prefetch
eight 16-bit words per cycle. It is responsible for maximizing the probability that the
instruction cache has the data required by the instruction sequencing unit (ISU) for any
given fetch cycle. The prefetch unit performs limited decoding to identify code
discontinuities and to apply static branch prediction when necessary. The ISU is
responsible for instruction fetch and decode, instruction grouping, and instruction issue.
Instruction grouping refers to the pipeline stage in which operand dependencies are
resolved. The ISU issues groups of in-order instructions that will not cause any operand
conflicts. This is the only unit (and only stage in the execution pipeline) that enforces
pipeline protection. Isolating the pipeline protection logic in this manner simplifies pipeline
control logic significantly.
The ZSP600 ISU can issue up to six instructions per cycle, one to each of the six primary
datapaths: two address generation units (AGUs), two arithmetic logic units (ALUs), and
two multiply/accumulate/arithmetic units (MAUs) that are capable of performing up to four
MAC operations per cycle. The pipeline control unit (PCU) stages control associated with
each of the primary data paths and the bypass logic. The PCU is also responsible for
managing interrupt control, the co-processor interface, the debug interface, and the on-core
timers. The Bypass unit (BYP) handles all the data forwarding between execution units.)

PIPELINE
The pipeline of the G2 architecture is an eight-stage pipeline. The existing architecture uses
a data prefetch mechanism, called data linking, to efficiently sustain required data
bandwidth for its dual- MAC. All pipeline protection and resource allocation is performed
during the grouping stage. Instruction groups are issued by the grouping stage and advance
in lock step down the remainder of the pipeline.
Data address generation is performed in the AG stage. This stage is also responsible for
enforcing the boundaries of the circular buffers. A load or store that straddles a boundary of
the circular buffer is split by the AGU into two sequential accesses. Stages M0 and M1 are
allocated for data memory loads. They are optimized for systems using synchronous RAM.
M0 is allocated for address decode and M1 for data access and return. Load and store
requests are registered and issued to the memory subsystem in M0. The memory interface
is stallable. If the MSS determines that is can not return requested data during M1, it stalls
the core until the data is ready.
ARITHMETIC RESOURCES
By adding two AGUs, along with dedicated address registers , the arithmetic throughput of
G2 demonstrates an immediate improvement. The two AGUs allow the core to issue any
combination of two loads or stores per cycle. The data size of the load/store is
implementation specific. Each data port in the ZSP600 is 64-bits wide, allowing a total of
128-bits (8 words) of data to be loaded per cycle. The AGUs have dedicated hardware to
support four circular buffers and reverse-carry addressing. The circular buffer support has
been enhanced in functionality to support load/store operations with positive and negative
offsets and signed indexes. Circular buffer logic also applies to address arithmetic and also
has no alignment restrictions.

REGISTER RESOURCES
With the 32-bit address registers, the architecture allows implementations of the core to
remain flexible in defining the physical linear address space. The actual address register
remains a 32-bit register to ensure pointer sizes remain the same from one implementation
to the next. This also allows the address registers to be used as temporary registers for the
GPRs. Dedicated address registers simplify the instruction decoder and issue logic as it can
now identify address related operations and assign the datapath resources appropriately.
The primary operand resource of the AGUs is the address register file, allowing the
general-purpose register file to be physically optimized for data moving to and from the
ALUs and MAUs. The current generation defines two 32-bit registers and another 16-bit
register whose low and high bytes correspond to the upper byte of each accumulator
respectively thus resulting in a 40-bit accumulator. A guard byte is now available for each
of the eight extended 32-bit registers of the GPRs. Accumulators are also recognized in the
programming model by providing associated instruction set support for 40-bit arithmetic
and data loads and stores.
INSTRUCTION SET ENHANCEMENTS
A powerful enhancement to the new architecture is the ability to conditionally execute

instructions. The programming model for G2 allows programmers to define packets of
instructions that are predicated on a specified condition. The programmer then defines a
bracketed set of up to eight instructions that will be predicated in the execution pipeline
based on the specified condition. A packet of instructions can be issued over multiple
cycles, using the same operand and resource rules enforced by the grouping stage, but they
are considered atomic in the sense that a packet of instructions cannot be interrupted.
Interrupts can occur between successive packets of instructions..
Due to the inclusion of stack based operations in combination with the quad-word data
support, all general purpose registers, address registers, and index registers can be pushed
or popped in eleven clock cycles. The enhanced instruction set also includes new bit field
insert and extract operations, instructions to support 40-bit arithmetic,
multiply/accumulator instructions that accept both signed and unsigned operands, and
division assist instruction that returns a 16- bit quotient and remainder in 16 cycles.
POWER REDUCTION
The core implements a multi-tiered power saving scheme. At the highest level, the cores
power consumption can be controlled via instructions to idle the core when desired. This
feature, which is common among DSPs, allows the core to effectively shut down if it is not
being used. An interrupt is used to wake the core when needed. The second level of power
savings comes from an internal unit that dynamically controls clocks of other units on a
clock-by-clock basis.
PERFORMANCE
The pipeline of the G2 architecture has been designed to achieve 300MHz operation in a
0.13nm technology. Performance modeling suggests an average improvement of roughly
three times that of the existing architecture.

SHARC DSP FAMILY
Von Neumann architecture contains a single memory and a single bus for transferring
data into and out of the central processing unit (CPU). Multiplying two numbers requires at
least three clock cycles, one to transfer each of the three numbers over the bus from the
memory to the CPU. Harvard architecture insisted on separate memories for data and
program instructions, with separate buses for each. Since the buses operate independently,
program instructions and data can be fetched at the same time, improving the speed over
the single bus design.
SHARC® DSPs, a contraction of the longer term, Super Harvard ARChitecture. The
idea is to build upon the Harvard architecture by adding features to improve the
throughput. SHARC DSPs are optimized in: an instruction cache, and an I/O controller.
Instruction cache improves the performance of the Harvard architecture. A handicap of the
basic Harvard design is that the data memory bus is busier than the program memory bus.
When two numbers are multiplied, two binary values must be passed over the data memory
bus, while only one binary value is passed over the program memory bus.DSP algorithms
generally spend most of their execution time in loops This means that the same set of
program instructions will continually pass from program memory to the CPU. The Super
Harvard architecture takes advantage of this situation by including an instruction cache in
the CPU. This is a small memory that contains about 32 of the most recent program
instructions. The first time through a loop, the program instructions must be passed over
the program memory bus. On additional executions of the loop, the program instructions
can be pulled from the instruction cache. This means that all of the memory to CPU
information transfers can be accomplished in a single cycle: the sample from the input
signal comes over the data memory bus, the coefficient comes over the program memory
bus, and the program instruction comes from the instruction cache. In the jargon of the
field, this efficient transfer of data is called a “high memory access bandwidth”.
The SHARC DSPs provides both serial and parallel communications ports. These are
extremely high speed connections., while six parallel ports each provide a 40
Mbytes/second data transfer. When all six parallel ports are used together, the data transfer
rate is an incredible 240 Mbytes/second.This type of high speed characteristic of DSP’s.

At the top of the diagram are two blocks labeled Data Address Generator (DAG), one for
each of the two memories. These control the addresses sent to the program and data
memories, specifying where the information is to be read from or written to. SHARC
DSPs, each of the two DAGs can control eight circular buffers. This means that each DAG
holds 32 variables, plus the required logic. In addition, an abundance of circular buffers
greatly simplifies DSP code generation- both for the human programmer as well as high-
level language compilers, such as C. The data register section of the CPU is used in the
same way as in traditional microprocessors. In the ADSP-210XX SHARC DSPs, there are
16 general purpose registers of 40 bits each. These can hold intermediate calculations,
prepare data for the math processor, serve as a buffer for data transfer, hold flags for
program control, and so on. If needed, these registers can also be used to control loops and
counters; however, the SHARC DSPs have extra hardware registers to carry out many of
these functions.
The math processing is broken into three sections, a multiplier, an arithmetic logic unit
(ALU), and a barrel shifter. The multiplier takes the values from two registers, multiplies
them, and places the result into another register. The ALU performs addition, subtraction,
absolute value, logical operations (AND, OR, XOR, NOT), conversion between fixed and
floating point formats, and similar functions. Elementary binary operations are carried out
by the barrel shifter, such as shifting, rotating, extracting and depositing segments, and so
on. A powerful feature of the SHARC family is that the multiplier and the ALU can be
accessed in parallel. In a single clock cycle, data from registers 0-7 can be passed to the
multiplier, data from registers 8-15 can be passed to the ALU, and the two results returned
to any of the 16 registers.
Another feature is the use of shadow registers for all the CPU's key registers. These
are duplicate registers that can be switched with their counterparts in a single clock cycle.
They are used for fast context switching, the ability to handle interrupts quickly. When an
interrupt occurs in traditional microprocessors, all the internal data must be saved before
the interrupt can be handled. This usually involves pushing all of the occupied registers
onto the stack, one at a time. In comparison, an interrupt in the SHARC family is handled
by moving the internal data into the shadow registers in a single clock cycle. When the
interrupt routine is completed, the registers are just as quickly restored.

Because of its highly parallel nature, the SHARC DSP can simultaneously carry out all of
these tasks. Specifically, within a single clock cycle, it can perform a multiply , an addition
, two data moves, update two circular buffer pointers, and control the loop. There will be
extra clock cycles associated with beginning and ending the loop; however, these tasks are
also handled very efficiently. If the loop is executed more than a few times, this overhead
will be negligible. To give you a better understanding of this issue. The important idea is
that the fixed point programmer must understand dozens of ways to carry out the very basic
task of multiplication. In contrast, the floating point programmer can spend his time
concentrating on the algorithm.
SHARC family can represent numbers in 32-bit fixed point, a mode that is common in
digital audio applications. This makes the 232 quantization levels spaced uniformly over a
relatively small range, say, between -1 and 1. In comparison, floating point notation places
the 232 quantization levels logarithmically over a huge range, typically ±3.4×1038. This
gives 32-bit fixed point better precision, that is, the quantization error on any one sample
will be lower. However, 32-bit floating point has a higher dynamic range, meaning there is
a greater difference between the largest number and the smallest number that can be
represented.
To handle these high-power tasks, several DSPs can be combined into a single system. This
is called multiprocessing or parallel processing. The SHARC DSPs were designed with this
type of multiprocessing in mind, and include special features to make it as easy as possible.
For instance, no external hardware logic is required to connect the external busses of
multiple SHARC DSPs together; all of the bus arbitration logic is already contained within
each device. As an alternative, the link ports (4 bit, parallel) can be used to connect
multiple processors in various configurations.

CONCLUSION
There are many different architectures evolving in the past few years.
Most of them have already marked their presence in their own respective areas. The recent
developments in DSP architectures are mainly marked by the introduction of VLIW,
Superscalar units as well as Super Harvard Architecture. The embedded field is also refined
by the introduction of DSP Hybrid devices. The current trend is focused into reducing the
power consumption with increased performance over respective materials. From the
detailed analysis of the different architecture of currently available DSP’s, we reach to a
conclusion that there is no common platform for evaluating these device performances.
Also each of them are well competitive enough for their products.

REFERENCES
Publications:
“Design Methodologies for VLSI DSP Architectures and Applications”. Kluwer.
“DSP Processor Fundamentals: Architectures and features” IEEE Press
“High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit”,

Computer Architecture News.
“Reconfigurable Processors for High-Performance & Embedded Digital Signal Processing”.
“High-Performance Microarchitectures with Hardware-Programmable Functional Units” , Proc.

27th Annual IEEE Intl. Symposium on Microarchitecture.
“Evaluation of a Low-Power Reconfigurable DSP Architecture”
“Reconfigurable Computing: the Solution to Low Power Programmable DSP”
“DSP Buyers Guide 2004 Edition.”
Websites:
www.dspguide.com
Official websites of :-
• Texas Instruments
• Analog Devices
• Lucent Technology
• LSI Logic
• Motorola Devices

Sanjay - High Performance DSP Architectures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sanjay - High Performance DSP Architectures

Uploaded by

Copyright:

Available Formats

High Performance DSP Architectures

Digital Signal Processing is carried out by mathematical operations. Digital Signal

In comparison to fixed-function solutions, they have the advantage of potentially being

DSP ALGORITHMS MOULD DSP ARCHITECTURES

Department Of Electronics & Communication Engineering, GEC Thrissur. 1

MULTIPLE EXECUTION UNITS

DSP applications typically have very high computational requirements in comparison to

EFFICIENT MEMORY ACCESSES

Department Of Electronics & Communication Engineering, GEC Thrissur. 2

SPECIALIZED INSTRUCTION SETS

Department Of Electronics & Communication Engineering, GEC Thrissur. 3

Traditionally, real-time architecture designs employ a mix of DSP’s, Co-processor’s,

LIMITATIONS OF SINGLE PROCESSOR DSP ARCHITECTURES

PROGRAMMABLE MULTIPROCESSOR DSP ARCHITECTURES

Department Of Electronics & Communication Engineering, GEC Thrissur. 4

MULTI-CHIP MIMD PROCESSORS

Register file explosion in traditional DSP’s with centralized register files.

SINGLE-CHIP MIMD PROCESSORS

Department Of Electronics & Communication Engineering, GEC Thrissur. 5

In contrast, a multi-threaded processor interleaves the execution of instructions of various

SIMD ARRAY PROCESSORS

SIMD VECTOR PROCESSORS

Department Of Electronics & Communication Engineering, GEC Thrissur. 6

PIPELINING MULTIPLE PROCESSORS

Department Of Electronics & Communication Engineering, GEC Thrissur. 7

COVENTIONAL DSP PROCESSORS

ENHANCED CONVENTIONAL DSP PROCESSORS

DSP processor architects improved performance by extending conventional DSP

Department Of Electronics & Communication Engineering, GEC Thrissur. 8

In contrast to conventional and enhanced-conventional processors, multi-issue processors

In VLIW architectures, a wide instruction word may be required in order to specify

Department Of Electronics & Communication Engineering, GEC Thrissur. 9

Department Of Electronics & Communication Engineering, GEC Thrissur. 10

INCREASING THE WORK PERFORMED BY EACH INSTRUCTION

EXECUTING MULTIPLE INSTRUCTIONS / CYCLE

Department Of Electronics & Communication Engineering, GEC Thrissur. 11

Department Of Electronics & Communication Engineering, GEC Thrissur. 12

• simplify the design task

Microcontroller vendors like Hitachi offers a DSP-enhanced version of their SH-2

Department Of Electronics & Communication Engineering, GEC Thrissur. 13

Figure : Â The conventional DSP

ENHANCED DSPS EMERGE

Department Of Electronics & Communication Engineering, GEC Thrissur. 14

Department Of Electronics & Communication Engineering, GEC Thrissur. 15

INSTRUCTION SETS TARGET APPLICATIONS

Department Of Electronics & Communication Engineering, GEC Thrissur. 16

Department Of Electronics & Communication Engineering, GEC Thrissur. 17

This ability to add coprocessors is becoming a common feature of high-performance DSP

Figure: Systolix's PulseDSP is a systolic array that can run as a coprocessor or as a

Department Of Electronics & Communication Engineering, GEC Thrissur. 18

Department Of Electronics & Communication Engineering, GEC Thrissur. 19

The TMS320C67xx family is the Highest Performance Floating-Point version DSPs. It is

Department Of Electronics & Communication Engineering, GEC Thrissur. 20

Department Of Electronics & Communication Engineering, GEC Thrissur. 21

Department Of Electronics & Communication Engineering, GEC Thrissur. 22

ANALOG DEVICES ADSP-21XX FAMILY

Department Of Electronics & Communication Engineering, GEC Thrissur. 23

The ADSP-21xx supports register-direct, memory-direct and register-indirect addressing

Department Of Electronics & Communication Engineering, GEC Thrissur. 24

Department Of Electronics & Communication Engineering, GEC Thrissur. 25

HIGH PERFORMANCE SIGNAL PROCESSING

FULLY INTERLOCKED INSTRUCTION PIPELINE