You are on page 1of 15

Architectures for Programmable DSPs

Basic Architectural Features

Architectures for Programmable DSPs


Dr. Deepa Kundur
Texas A&M

Reference: Chapter 4 of Singh and S. Srinivasan, Digital Signal Processing: Implementation Using DSP Microprocessors with Examples from TMS320C54XX, Brooks/Cole, Belmont, California, 2004.

Fall 2010

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

1 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

2 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Basic Architectural Features


A DSP is a specialized microprocessor for the purpose of real-time DSP computing. DSP applications commonly share the following characteristics:
Algorithms are mathematically intensive; common algorithms require many multiply and accumulates. Algorithms must run in real-time; processing of a data block must occur before next block arrives. Algorithms are under constant development; DSP systems should be exible to support changes and improvements in the state-of-the-art.

Basic Architectural Features


Programmable DSPs should provide for instructions that one would nd in most general microprocessors. The basic instruction capabilities (provided with dedicated high-speed hardware) should include:
arithmetic operations: add, subtract and multiply logic operations: AND, OR, XOR, and NOT multiply and accumulate (MAC) operation signal scaling operations before and/or after digital signal processing

Support architecture should include:


RAM; i.e., on-chip memories for signal samples ROM; on-chip program memory for programs and algorithm parameters such as lter coecients on-chip registers for storage of intermediate results
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 4 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

3 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

DSP Computational Building Blocks

Multiplier

Multiplier Shifter Multiply and accumulate (MAC) unit Arithmetic logic unit

The following specications are important when designing a multiplier:


speed decided by architecture which trades o with circuit complexity and power dissipation accuracy decided by format representations (number of bits and xed/oating pt) dynamic range decided by format representations

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

5 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

6 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Parallel Multiplier

Parallel Multiplier: Bit Expansion


Consider the multiplication of two unsigned xed-point integer numbers A and B where A is m-bits (Am1 , Am2 , . . . , A0 ) and B is n-bits (Bn1 , Bn2 , . . . , B0 ):

Advances in speed and size in VLSI technology have made hardware implementation of parallel or array multipliers possible. Parallel multipliers implement a complete multiplication of two binary numbers to generate the product within a single processor cycle!

m 1

A =
i =0 n 1

Ai 2i ; 0 A 2m 1, Ai {0, 1} Bj 2j ; 0 B 2n 1, Bi {0, 1}
j =0

B =

Generally, we will require r-bits where r > max((m), n) to represent the product P = A B ; known as bit expansion.
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 7 / 57 Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 8 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

Parallel Multiplier: Bit Expansion


Q: How many bits are required to represent P = A B ? Let the minimum number of bits needed to represent the range of P be given by r . An r -bit unsigned xed-point integer number can represent values between 0 and 2r 1. Therefore, 0 P 2r 1. Pmin = Amin Bmin = 0 0 = 0 Pmax = Amax Bmax = (2m 1) (2n 1) = 2n+m 2m 2n + 1

Parallel Multiplier: Bit Expansion


Rephrased Q: What is the minimum r Z+ such that (2n+m 2m 2n + 1) (2r 1)? A: We must nd an r in terms of m and n such that r 1 bits are insucient to represent Pmax , but r bits are sucient: (2r 1 1)
max w/ r 1 bits

<

(2n+m 2m 2n + 1)
Pmax

(2r 1)
max w/ r bits

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

9 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

10 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Parallel Multiplier: Bit Expansion


Recall: (2n+m 2m 2n + 1) (2r 1) need to nd min r Consider m, n Z+ , then: (2n+m 2m 2n + 1) < (2n+m 1) let r = m + n
<1

Parallel Multiplier: Bit Expansion

Therefore for r = n + m and large m, n (of practical value): (2r 1 1) < (2n+m 2m 2n + 1) (2r 1) Thus, r = n + m is the minimum number of bits required to fully represent the range of P .

Is r = m + n the minimum required to represent Pmax ? Consider m, n large, then: (2n+m 2m 2n + 1) 2n+m (2m+n1 1) < 2n+m (2n+m 2m 2n + 1)

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

11 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

12 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

Parallel Multiplier: n = m = 4
Example: m = n = 4; note: r = m + n = 4 + 4 = 8.
41 41

Parallel Multiplier: n = m = 4
Recall base 10 addition. Example: (3785 + 6584)
CARRY-OVER

= AB =
i =0

Ai 2i
1 i =0 A2 22

Bi 2i + A3 23 B0 20 + B1 21 + B2 22 + B3 23

A0 2 + A1 2 +

= A0 B0 20 + (A0 B1 + A1 B0 )21 + (A0 B2 + A1 B1 + A2 B0 )22 +(A0 B3 + A1 B2 + A2 B1 + A3 B0 )23 + (A1 B3 + A2 B2 + A3 B1 )24 +(A2 B3 + A3 B2 )25 + (A3 B3 )26 = P0 20 + P1 21 + P2 22 + P3 23 + P4 24 + P5 25 + P6 26 + P7 27

1 3 + 6 1 0

1 7 5 3

0 8 8 6

0 5 4 9

CARRY-OVER

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

13 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

14 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Parallel Multiplier: n = m = 4
Example: m = n = 4; note: r = m + n = 4 + 4 = 8.
41 41

Parallel Multiplier: n = m = 4

Need to compensate for carry-over bits! Bi 2i + A3 23 B0 20 + B1 21 + B2 22 + B3 23 P0 P1 P2 = A0 B0 = A0 B1 + A1 B0 + PREV CARRY OVER = A0 B2 + A1 B1 + A2 B0 + PREV CARRY OVER . . . = A3 B3 + PREV CARRY OVER =
PREV CARRY OVER

= AB =
i =0

Ai 2i
i =0 A2 22

A0 20 + A1 21 +

= A0 B0 20 + (A0 B1 + A1 B0 )21 + (A0 B2 + A1 B1 + A2 B0 )22 +(A0 B3 + A1 B2 + A2 B1 + A3 B0 )23 + (A1 B3 + A2 B2 + A3 B1 )24 +(A2 B3 + A3 B2 )25 + (A3 B3 )26 = P0 20 + P1 21 + P2 22 + P3 23 + P4 24 + P5 25 + P6 26 + P7 27 Need to compensate for carry-over bits!

P6 P7

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

15 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

16 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

Parallel Multiplier: Braun Multiplier


Example: structure of a 4 4 Braun multiplier; i.e., m = n = 4
operand carry-in

Parallel Multiplier: Braun Multiplier

+
carry-out sum

operand

+ + + +

+ +

+ + +

+ +

Speed: for parallel multiplier the multiplication time is only the longest path delay time through the gates and adders (well within one processor cycle) Note: additional hardware before and after the Braun multiplier is required to deal with signed numbers represented in twos complement form.

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

17 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

18 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Parallel Multiplier
Bus Widths: straightforward implementation requires two buses of width n-bits and a third bus of width 2n-bits, which is expensive to implement
Multiplier

Shifter
required to scale down or scale up operands and results to avoid errors resulting from overows and underows during computations Q: When is scaling useful?
Computing sum of N numbers, each represented by n-bits; the overall sum will have n + log2 N bits to avoid overow can scale down each of the N numbers by log2 N bits before conducting the sum to obtain actual sum scale up the result by log2 N bits when required trade-o between overow prevention and accuracy

To avoid complex bus implementations:


program bus can be reused after the multiplication instruction is fetched bus for X can be used for Z by discarding the lower n bits of Z or by saving Z at two successive memory locations
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 19 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

20 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

Shifter

Barrel Shifter
Shifting in conventional microprocessors is implemented by an operation similar to one in a shift register taking one clock cycle for every single bit shift.

Q: When is scaling useful?


Conducting oating point additions, where each operand should be normalized to the same exponent prior to addition one of the operands can be shifted to the required number of bit positions to equalize the exponents

Many shifts are often required creating a latency of multiple clock cycles.

Barrel shifters allow shifting of multiple bit positions within one clock cycle reducing latency for real-time DSP computations.
input

Shifter
left/right

output number of bit positions for shift

control inputs

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

21 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

22 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Barrel Shifter
Implementation of a 4-bit shift-right barrel shifter:
Input Bits

Barrel Shifter
Implementation of a 4-bit shift-right barrel shifter:

switch closes when control signal is ON

Input A3 A2 A1 A0 A3 A2 A1 A0 A3 A2 A1 A0 A3 A2 A1 A0

Shift (Switch) 0 (S0 ) 1 (S1 ) 2 (S2 ) 3 (S3 )

Output (B3 B2 B1 B0 ) A3 A2 A1 A0 A3 A3 A2 A1 A3 A3 A3 A2 A3 A3 A3 A3

Note: only one control signal can be on at a time

logic circuit takes a fraction of a clock cycle to execute majority of delay is in decoding the control lines and setting up the path from the input lines to the output lines

Output Bits

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

23 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

24 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

Barrel Shifter
Input Bits

Multiply and Accumulate

switch closes when control signal is ON

multiply and accumulate (MAC) unit performs the accumulation of a series of successively generated products common operation in DSP applications such as ltering

Note: only one control signal can be on at a time

Output Bits

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

25 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

26 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

MAC Unit Conguration


Multiplier Product Register

MAC Unit Conguration


Multiplier Product Register

ADD/SUB

can implement A + BC operations clearing the accumulator at the right time (e.g., as an initialization to zero) provides appropriate sum of products

multiplication and accumulation each require a separate instruction execution cycle however, they can work in parallel
when multiplier is working on current product, the accumulator works on adding previous product

ADD/SUB

Accumulator

Accumulator

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

27 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

28 / 57

Architectures for Programmable DSPs

DSP Computational Building Blocks

Architectures for Programmable DSPs

DSP Computational Building Blocks

MAC Unit Conguration


Multiplier Product Register

MAC Unit

if N products are to be accumulated, N 1 multiplies can overlap with the accumulation


during the rst multiply, the accumulator is idle during the last accumulate, the multiplier is idle since all N products have been computed

Q: If a sum of 256 products is to be computed using a pipelined MAC unit and if the MAC execution time of the unit is 100 ns, what is the total time required to compute the operation? A: For 256 MAC operations, need 257 execution cycles. Total time required = 257 100 109 sec = 25.7s

ADD/SUB

Accumulator

to compute a MAC for N products, N + 1 instruction execution cycles are required for N 1, works out to almost one MAC operation per instruction cycle

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

29 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs DSP Computational Building Blocks

Fall 2010

30 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

MAC Unit Overow and Underow


Multiplier Product Register

Arithmetic and Logic Unit


arithmetic logic unit (ALU) carries out additional arithmetic and logic operations required for a DSP:
add, subtract, increment, decrement, negate AND, OR, NOT, XOR, compare shift, multiply (uncommon to general microprocessors)

Strategies to address overow or underow:


accumulator guard bits (i.e., extra bits for the accumulator) added; implication: size of ADD/SUB unit will increase barrel shifters at the input and output of MAC unit needed to normalize values saturation logic used to assign largest (smallest) values to accumulator when overow (underow) occurs

ADD/SUB

with additional features common to general microprocessors:


status ags for sign, zero, carry and overow overow management via saturation logic register les for storing intermediate results

Accumulator

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

31 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

32 / 57

Architectures for Programmable DSPs

Bus Architecture and Memory

Architectures for Programmable DSPs

Bus Architecture and Memory

Bus Architecture and Memory

von Neumann Architecture


Address

Processor
Data

Memory

Bus architecture and memory play a signicant role in dictating cost, speed and size of DSPs. Common architectures include the von Neumann and Harvard architectures.

program and data reside in same memory single bus is used to access both Implications:
slows down program execution since processor has to wait for data even after instruction is made available

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Bus Architecture and Memory

Fall 2010

33 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Bus Architecture and Memory

Fall 2010

34 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Harvard Architecture
Address Data Address Data

Another Possible DSP Bus Structure


Program Memory Data Memory
Address Data Address

Program Memory Data Memory Data Memory

Processor

Processor
Data Address

program and data reside in separate memories with two independent buses Implications:
faster program execution because of simultaneous memory access capability

Data

suited for operations with two operands (e.g., multiplication) Implications: requires hardware and interconnections increasing cost
hardware complexity-speed trade-o needed!

What if there are two operands?


Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 35 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

36 / 57

Architectures for Programmable DSPs

Bus Architecture and Memory

Architectures for Programmable DSPs

Bus Architecture and Memory

On-Chip Memory

On-Chip Memory

help in running the DSP algorithms faster than when memory is o-chip
dedicated addresses and data buses are available

Wish List:
all memory should reside on-chip! separate on-chip program and data spaces on-chip data space partitioned further into areas for data samples, coecients and results

speed: on-chip memories should match the speeds of the ALU operations size: the more area chip memory takes, the less area available for other DSP functions

Implication: area dedicated to memory will be so large that basic DSP functions may not be implementable on-chip!

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Bus Architecture and Memory

Fall 2010

37 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Bus Architecture and Memory

Fall 2010

38 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Practical Organization of On-Chip Memory

Practical Organization of On-Chip Memory

for DSP algorithms requiring repeated executions of a single instruction (e.g., MAC):
instructions can be placed in external memory and once fetched can be placed in the instruction cache the result is normally saved at the end, so external memory can be employed only two data memories for operands can be placed on-chip

using dual-access on-chip memories (can be accessed twice per instruction cycle):
can get away with only two on-chip memories for instructions such as multiply instruction fetch + two operand fetches + memory access to save result can all be done in one clock cycle can congure on-chip memory for dierent uses at dierent times

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

39 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

40 / 57

Architectures for Programmable DSPs

Data Addressing

Architectures for Programmable DSPs

Data Addressing

Data Addressing Capabilities

DSP Addressing Modes

ecient way of accessing data (signal sample and lter coecients) can signicantly improve implementation performance exible ways to access data helps in writing ecient programs data addressing modes enhance DSP implementations

immediate register direct indirect special addressing modes:


circular bit-reversed

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Data Addressing

Fall 2010

41 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Data Addressing

Fall 2010

42 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Immediate Addressing Mode


operand is explicitly known in value capability to include data as part of the instruction Instruction ADD #imm Operation #imm + A A

Register Addressing Mode


operand is always in processor register reg capability to reference data through its register Instruction ADD reg Operation reg + A A

#imm: value represented by imm (xed number such as lter coecient is known ahead of time) A: accumulator register

reg : processor register provides operand A: accumulator register

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

43 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

44 / 57

Architectures for Programmable DSPs

Data Addressing

Architectures for Programmable DSPs

Data Addressing

Direct Addressing Mode


operand is always in memory location mem capability to reference data by giving its memory location directly Instruction ADD mem Operation mem + A A

Indirect Addressing Mode


operand memory location is variable operand address is given by the value of register addrreg operand accessed using pointer addrreg Instruction Operation addrreg + A A

mem: specied memory location provides operand (e.g., memory could hold input signal value) A: accumulator register

ADD addrreg

addrreg : needs to be loaded with the register location before use A: accumulator register

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Data Addressing

Fall 2010

45 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Data Addressing

Fall 2010

46 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Special Addressing Modes

Special Addressing Modes


Bit-Reversed Addressing:

provided to implement real-time digital signal processing and FFT algorithms


Circular Addressing Mode: circular buer allows one to handle a continuous stream of incoming data samples; once the end of the buer is reached, samples are added to the beginning again Bit-Reversed Addressing Mode: address generation unit can be provided with the capability of providing bit-reversed indices

Input Index 000 = 0 001 = 1 010 = 2 011 = 3 100 = 4 101 =5 110 = 6 111 = 7

Output Index 000 = 0 100 = 4 010 = 2 110 = 6 001 = 1 101 = 5 011 = 3 111 = 7

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

47 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

48 / 57

Architectures for Programmable DSPs

Speed Issues

Architectures for Programmable DSPs

Speed Issues

Speed Issues

Hardware Architecture

fast execution of algorithms is the most important requirement of a DSP architecture


high speed instruction operation large throughputs

facilitated by advances in VLSI technology and design innovations

dedicated hardware support for multiplications, scaling loops and repeats, and special addressing modes are essential for fast DSP implementations Harvard architecture signicantly improves program execution time compared to von Neumann on-chip memories aid speed of program execution considerably

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Speed Issues

Fall 2010

49 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Speed Issues

Fall 2010

50 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Parallelism
Parallelism means: provision of function units, which may operate in parallel to increase throughput
multiple memories dierent ALUs for computations on data and address computation

Pipelining
architectural feature in which an instruction is broken into a number of steps
a separate unit performs each step at the same time usually working on dierent stage of data

advantage: algorithms can perform more than one operation at a time increasing speed disadvantage: complex hardware required to control units and housekeeping required to make sure instructions and data can be fetched simultaneously

advantage: if repeated use of the instruction is required, then after an initial latency the output throughput becomes one instruction per unit time disadvantages: pipeline latency, having to break instructions up into equally-timed units

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

51 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs

Fall 2010

52 / 57

Architectures for Programmable DSPs

Speed Issues

Architectures for Programmable DSPs

Speed Issues

Pipelining Example
Five steps: Step 1: instruction fetch Step 2: instruction decode Step 3: operand fetch Step 4: execute Step 5: save Time Slot Step 1 t0 Inst 1 t1 Inst 2 t2 Inst 3 t3 Inst 4 t4 Inst 5 t5 Inst 6 . . . . . . Simplifying assumption: Step 2 Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 . . . all steps Step 3 Step 4 Step 5 Result

System Level Parallelism and Pipelining


Consider 8-tap FIR lter:
7

y (n)

=
i =0

h(i )x (n i ) h(0)x (n) + h(1)x (n 1) + h(2)x (n 2) + + h(6)x (n 6) + h(7)x (n 7)

Inst 1 Inst 2 Inst 3 Inst 4 . . . take equal

Inst 1 Inst 2 Inst 3 . . . time

Inst 1 complete Inst 2 complete . . .

. . .

can be implemented in many ways depending on number of multipliers and accumulators available

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Speed Issues

Fall 2010

53 / 57

Dr. Deepa Kundur (Texas A&M)

Architectures for Programmable DSPs Speed Issues

Fall 2010

54 / 57

Architectures for Programmable DSPs

Architectures for Programmable DSPs

Implementation Using a Single MAC Unit


x(n) x(n-1) x(n-2) x(n-3) x(n-4) x(n-5) x(n-6) x(n-7)

Pipelined Implementation: 8 Multipliers and 8 Accumulators


x(n) x(n-1) x(n-2) x(n-3) x(n-4) x(n-5) x(n-6) x(n-7)

8T

8T

8T

8T

8T

8T

8T
h(0)

T
h(1)

T
h(2)

T
h(3)

T
h(4)

T
h(5)

T
h(6)

T
h(7)

Multiplexer

0 MAC MAC MAC MAC MAC MAC MAC MAC

y(n)

MAC Unit

y(n)

Multiplexer

h(0)

h(1)

h(2)

h(3)

h(4)

h(5)

h(6)

h(7)

T : time taken to compute one product term and add it to accumulator new input sample can be processed every 8T time units
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 55 / 57

T : time taken to compute one product term and add it to accumulator new input sample can be processed every T time units (8 times faster!)
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 56 / 57

Architectures for Programmable DSPs

Speed Issues

Parallel Implementation: Two MAC Units


h(0) h(1) h(2) h(3) h(4) h(5) h(6) h(7)

Multiplexer

Multiplexer

MAC Unit

MAC Unit

Multiplexer

Multiplexer

4T
x(n) x(n-1)

4T
x(n-2)

4T
x(n-3)

4T
x(n-4)

4T
x(n-5)

4T
x(n-6)

4T
x(n-7)

+
y(n)

T : time taken to compute one product term and add it to accumulator new input sample can be processed every 4T time units
Dr. Deepa Kundur (Texas A&M) Architectures for Programmable DSPs Fall 2010 57 / 57