## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Lecture # 05

Dr. Rehan Hafiz

<rehan.hafiz@seecs.edu.pk>

Course Information

Couse Website http://lms.nust.edu.pk/ Slides from Advanced Digital System Design (FALL 2011) Course

http://www.scribd.com/collections/3409162/Digital-System-Design-Lectures

Acknowledgement: Material from the following sources has been consulted/used in these slides: 1. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan 2. [SAM] Samir Palnitkar, “Verilog HDL”, Prentice Hall, ISBN: 0130449113. , Latest Edition 3. [STV] Advanced FPGA Design, Steve Kilts 4. [PAR] VLSI Signal Processing Systems, Parhi 5. Some slides from : [ECEN 248 Dr Shi] Material/Slides from these slides CAN be used with following citing reference: Dr. Rehan Hafiz: Advanced Digital System Design 2012 Creative Commons Attribution--ShareAlike 3.0 Unported License.

Lectures: Contact: Office:

Tuesday (1730-1920), Thursday (1830-1920) By appointment/Email VISpro Lab above SEECS Library

1 2 3 4 5 7 8 9 10 11 13 14 15 16 17

Introduction: Course Overview, Design Space Exploration, Digital design methodology Understanding FPGAs, (Xilinx FPGA Architecture) Verilog Introduction : Combinational Building Blocks in Verilog Sequential Common Structure in Verilog (LFSR /CRC+ Counters + RAMS) Synthesis of Blocking/Non-Blocking Statements Design Partitioning & Micro Architectures Controllers, Micro-Coded Controllers

Understanding Throughput, Latency &Timing & Architecting Speed/Area in Digital System Design. Representation of Non Recursive DFGs & Optimizations for Non Recursive DFGs

FIR Implementations + Pipelining & Parallelism in Non Recursive DFGs

**Cross-Clock Domain Issues & RESET circuits
**

Arithmetic Operations: Review Fixed Point Representation Adders & Fast Adders, Multi-Operand Addition Multiplication , Multiplication by Constants + BOOTH Multipliers CORDIC (sine, cosine, magnitude, division, etc) CORDIC implementation in HW DFG representation of Recursive DSP Algorithms Iteration Bound Retiming , Unfolding, Look ahead transformations Hybrid Architectures / Kahn Process Networks

This Lecture ….

4

Understanding & Optimizing

Speed Throughput Timings

Reading Assignment

Chapter

-1: Advanced FPGA Design, by Steve Kilts

Speed

5

Throughput

Amount

of data that is processed per clock cycle

Metric: bits/sec

Latency

Time

**between data input and processed data output
**

cycles or time

Metric: No. of

Timing

Logic

**delays between sequential elements
**

Clock period or Frequency.

Metric :

**A high-throughput design
**

More concerned with the steady-state data rate Less concerned about the time any speciﬁc piece of data requires to propagate through the design (latency) Pipelining

6

Techniques

High Throughput Design

Throughput

top-level entity

8 bits

D clk Q Combinational Logic clk D Q Combinational Logic clk D Q

8 bits

input

output

100MHz

clk

input

output

input(0)

input(1)

(unknown)

input(2)

output(0) output(1)

1 cycle betweeen Throughput = (bits per output sample) / (time between consecutive output samples) output samples Bits per output sample:

In this example, 8 bits per output sample Can be measured in clock cycles, then translated to time In this example, time between consecutive output samples = 1 clock cycle = 10 ns

**Time between consecutive output samples: clock cycles between output(n) to output(n+1)
**

Throughput = (8 bits per output sample) / (10 ns) = 0.8 bits / ns = 800 Mbits/s

An Example...

[KIL]

Software Code Digital Implementation

XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower;

Same register and computational resources are reused No new computations can begin until the previous computation has completed

Throughput 8/3 = 2.7 bits/cyc. (Ideally in sec.) Latency Timing 3 clk cycles 1 Multiplier Delay

Coding an iterative algorithm <with dependency>

module power3( output [7:0] XPower, output finished, input [7:0] X, input clk, start); reg reg [7:0] ncount; [7:0] XPower;

assign finished = (ncount == 0); XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower; always@(posedge clk) if(start) begin XPower <= X; ncount <= 2; End else if(!finished) begin ncount <= ncount - 1; XPower <= XPower * X; End endmodule

Loop Unrolling

10

XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower;

XPower1

XPower2

X1

X2

Coding

module power3( 11 output reg [7:0] XPower, input clk, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; always @(posedge clk) begin // Pipeline stage 1 X1 <= X; XPower1 <= X; // Pipeline stage 2 X2 <= X1; XPower2 <= XPower1 * X1; // Pipeline stage 3 XPower <= XPower2 * X2;

end endmodule

XPower1

XPower2

X1

X2

**Throughput 8/3 = 2.7 bits/cyc.
**

Latency Timing 12

ft

REV

3 clk cycles 1 Multiplier Delay

XPower1

XPower2

X1

**Throughput 8/1 = 8 bits/cyc.
**

X2

Latency Timing

3 clk cycles 1 Multiplier Delay

13

In general, if an algorithm requiring n iterative loops is “unrolled,” the pipelined implementation will exhibit a throughput performance increase of a factor of n. The penalty for unrolling an iterative loop is a proportional increase in area.

A low-latency design is one that passes the data from the input to the output as quickly as possible by minimizing the intermediate processing delays. Technique

Removal of pipelining, and logical short cuts that may reduce the throughput or the max clock speed in a design Parallelisms

Decreasing Latency

14

Latency

top-level entity

8 bits

D clk Q Combinational Logic clk D Q Combinational Logic clk D Q

8 bits

input

output

100 MHz clk

input

output

input(0)

input(1)

(unknown)

input(2)

output(0) output(1)

**Latency is the time between input(n) and output(n)
**

i.e. time it takes from first input to first output, second input to second output, etc. Also called input-to-output latency In this example, 2 rising edges latency is 2 cycles In this example, say clock period is 10 ns, then latency is 20 ns

**Count the number of rising edges after input
**

**Latency is measured in clock cycles (then translated to seconds)
**

Removal of pipelining

REV

Throughput 8/1 = 8 bits/cyc. Latency Timing 1 Cycle 2 Multiplier Delays

Penalty

17

Penalty in timing Previous implementations could theoretically run the system clock period close to the delay of a single multiplier For Low-latency implementation, the clock period must be at least two multiplier delays

module power3( output [7:0] XPower, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; assign XPower = XPower2 * X2; always @* begin X1 = X; XPower1 = X; end always @* begin X2 = X1; XPower2 = XPower1*X1; end endmodule

18

Understanding Timing

Timings

19

Combinational

Logic

& Routing Delay

Flip Flops

delay tCLK2Q Some Constraints

Propagation

Setup time Hold time

**Timing: Combinational Logic tLOGIC + trouting
**

Classification

tLOGIC :propagation

delay through logic components (e.g. LUTs) trouting :propagation delay through routing (wires)

tLOGIC The output remains unchanged for a time period equal to the contamination delay, tcd The new output value is guaranteed to be valid after a time period equal to the propagation delay, tLOGIC

**Timing: Flip Flops (Sequential Logic)
**

Input D must remain stable during this interval = {tS+tH} clk D Q

Input D can freely change during this interval

clk

D

tS

Q

tH

tCLK2Q Setup time tS – minimum time the input has to be stable before the rising edge of the clock Hold time tH – minimum time the input has to be stable after the rising edge of the clock Propagation delay tCLK2Q – time to propagate input to output after the rising edge of the clock

**Timing: Path Delay
**

D clk Q

Launch Flip Flop

A path is defined as a path from the output of one flip-flop to the input of another flip-flop

Combinational Logic

REV

D clk

Q

Capture Flip Flop

tCLK2Q clk

tLOGIC

tRout ts

tCLK2Q

**+ tLOGIC+ tROUTING < (T - tS ) to avoid setup time violation Rewriting the equation: tCLK2Q + tLOGIC + trouting + tS < T
**

tpath

CLOCK PERIOD T

**Critical Path Delay
**

Path delay tpath = tCLK2Q + tLOGIC + tROUTE + tS The largest of all the path delays in a circuit is called the critical path delay (tcritical_path)

The

associated path is called the critical path There can be millions of paths in a circuit; timing analysis CAD tools help to locate the critical path

Critical Path

D Q 1.1 ns PATH 1 D Q D Q 0.5 ns PATH 2 tS=0.2 ns D Q tCLK2Q=0.4 ns

tCLK2Q=0.4 ns

D Q

PATH 3 0.8 ns

tCLK2Q=0.4 ns

PATH 4

tS=0.2 ns

Path delays: tpath1 = 2.2 ns, tpath2 = 1.1 ns, tpath3 = 3.0 ns, tpath4 = 1.4 ns The critical path is path 3; the critical path delay is tcritical_path = tpath3 = 3.0 ns

**Critical Path- Example-2
**

twire1=0.4 ns D Q tgateA=2.0 ns twire2=0.2 nstgateB=1.2 ns twire3=0.8 ns

Combinational Gate A Combinational Gate B

D Q

tCLK2Q=0.4 ns

tS=0.2 ns

tCLK2Q twire1 tgateA twire2 tgateB twire3 ts

clk

CLOCK PERIOD T

**Critical path delay = tcritical_path = 5.2 ns The minimum period for this circuit to work is Tmin = 5.2 ns
**

If the clock period is smaller than Tmin, you will get a timing violation and circuit will not operate correctly!!

Maximum clock frequency = 1/Tmin = 192 MHz

This kind of timing violation is called a "setup time" violation (also known as critical path violation)

26

MaX Frequency

**Review – From Last Lecture
**

Throughput

27

Amount of data that is processed per clock cycle OR The aggregate/average data processing rate

Ideally average data rate IN to your system should be equal to the average data rate OUT of your system – OR you will miss data !

Improved by : Pipelining & Loop Unrolling ! Streaming Applications – More concerned with throughput !

Metric: bits/sec

(Do not use bits/cycle as in example that I described in class)

Latency

Time between data input and processed data output Improved by Parallelising the system Response Time --- Important for Time Critical Signals, e.g. some interrupt triggered operation processing an external signal of an avionics system !

Metric: Time in terms of seconds

Timing

**Logic delays between sequential elements
**

Metric : Clock period or Frequency.

[tCLK2Q + tLOGIC + trouting + tS ]< T

Normally a compromise !

Timing

28

Timing

**Logic delays between sequential elements
**

Metric : Clock period or Frequency.

[tCLK2Q + tLOGIC + trouting + tS ]< T Rising Edge of the Clock Does Not Arrive at Clock Inputs of All Flip-flops at The Same Time

Clock Skew

Clock Skew

**Delay often caused by wire routing delay
**

Lag clock skew

in

D

Q

clk'

D

Q

out

clk clk'

tskew

clk

delay

**Lead clock skew
**

in D Q D Q out clk

delay

clk

clk'

tskew

30

Positive slack

When the data arrives at the capture flip-flop before the capture clock less the setup time. If the data arrive after the capture clock less the setup time -ve slack is an issue

Negative Slack

Lead clock skew is bad because it may cause setup time violations

REV

D clk

Q

Combinational Logic clk

D

Q clk

D

Q

Combinational Logic clk'

D

Q

tCLK2Q clk

tLOGIC+tROUTE ts clk

tCLK2Q tLOGIC+tROUTE

ts CLOCK PERIOD T clk' CLOCK PERIOD T

tskew

WITHOUT SKEW: tCLK2Q + tLOGIC + tROUTE + ts < T to avoid setup time violation

WITH SKEW: tCLK2Q + tLOGIC + tROUTE + ts < (T – tskew) to avoid setup time violation less time to perform logic than you normally would Soln: Optimize/Pipeline/Speedgrade !

**Lag clock skew is bad because it may cause hold time violations
**

D clk Q Combinational Logic clk' D Q

REV

tCLK2Q tLOGIC+Route

clk tH clk

tskew

clk'

tH

LAUNCH(tCLK2Q + tLOGIC + tROUTE) > CAPTURE(tskew + tH ) to avoid hold time violation If this is violated Data gets fed into the next register one cycle too early There is no clock period (T) in the equation; changing clock period cannot help this problem! Solution : Add dummy logic – Not-Not

**Maximum Achievable Frequency
**

Maximum-frequency equation (ignoring clockto-clock jitter):

Tskew is propagation delay of clock between the launch ﬂip-ﬂop and the capture ﬂip-ﬂop -ve,+ve depends on lead or lag

Reading Assignment

34

35

Some Examples

**Example 1: Analyzing Sequential Circuits
**

TClk-Q = 5ns D X Tlogic+Route = 5ns Comb. Logic Y

TClk-Q = 5 ns Ts = 2 ns

D

Q

D

Q

Z

FFA CLK

FFB

G

**° What is the minimum time between rising clock edges?
**

• Tmin = TCLK-Q (FFA) + TLogic (G) + TRoute (G) + Ts (FFB)

**Example: 2 Hold Time Violation
**

Tclk2Q = 1ns

D X Tcd = 2ns Comb. Logic G Y Th = 2 ns

REV

D

Q

D

Q

Z

FFA CLK

FFB

° Shall we get Hold Time Violation in this example ? ° Make sure Y remains stable for hold time (Th) after rising clock edge ° Remember: contamination delay ensures signal doesn’t change

• TCLK2Q(FFA) + Tcd(G) >= (Tskew + Th) • (1ns + 2ns) > 2ns Tskew shall be positive for lag

Example-3

Togic+Route= 4ns Comb. Logic F

Class Work

X

D

CLK

Q

Comb. Logic H Tlogic+Route = 5ns

Y

D

Q

Z

FFA

FFB

TClk-Q = 5ns

TClk-Q = 4 ns Ts = 2 ns

° What is the minimum clock period (Tmin) of this circuit? ° What if FFB has a clock skew – Lead of 1 ns

Solution

Tlogic+Route = 4ns Comb. Logic F

X

D

CLK

Q

Comb. Logic H Tlogic+Route = 5ns

Y

D

Q

Z

FFA

FFB

TClk-Q = 5ns

° Path FFA to FFB

TClk-Q = 4 ns Ts = 2 ns

• TClk-Q(FFA) + Tpd(H) + Ts(FFB) = 5ns + 5ns + 2ns = 12ns • TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

° Path FFB to FFB

**Solution(With Lead of 1 ns for FFB)
**

Tlogic+Route = 4ns Comb. Logic F

X

D

CLK

Q

Comb. Logic H Tlogic+Route = 5ns

Y

D

Q

Z

FFA

FFB

TClk-Q = 5ns

° Path FFA to FFB

TClk-Q = 4 ns Ts = 2 ns

• TClk-Q(FFA) + Tpd(H) + Ts(FFB) + Tskew= 5ns + 5ns + 2ns + 1ns= 13ns • TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

° Path FFB to FFB

Example Analyzing Sequential Circuits: Hold Time Violations

Tlogic+Route = 1ns All paths must satisfy requirements Comb. Logic F

X

D

CLK

Q

Comb. Logic H Tlogic+Route = 2ns

Y

D

Q

Z

FFA

FFB

° Path FFA to FFB

Tclk2Q = 1ns

• TClk2q(FFA) + Tlogic+Route (H) > Th(FFB) = 1 ns + 2ns > 2ns

Tclk2Q = 1 ns Th = 2 ns

° Path FFB to FFB

• TClk2q (FFB) + Tlogic+Route(F) + Tlogic+Route (H) > Th(FFB) = 1ns + 1ns + 2ns > 2ns

QUIZ

Quiz No. 3

Name:

Reg. No.

Que-1 : Assuming no CLK Skews; What is Minimum Clock Period at whi this circuit can operate & what shall be the Critical Path? (5 Marks) Que-2 : What is the maximum achievable Throughput (in Bits/Sec) for this circuit ? (2 Marks) Que-3 : What is the Minimum Clock Period at which this circuit can operate IF CLK’ has a lead of 2 ns as compared to CLK (3 Marks)

TClk-Q = 4 ns Ts = 2 ns Tlogic(F) = 4ns Tlogic(G) = 5ns Tlogic(H) = 1ns Tlogic(I) = 2ns

8 Bit

Comb. Logic H

output

X

Y Comb. Logic F

input

D Q

FFA

D Q

FFB

Z

Comb. Logic G

8 Bit

D Q

FFC

CLK

CLK

CLK’

Comb. Logic I

Quiz No. 3

Name:

Reg. No.

Que-1 : Assuming no CLK Skews; What is Minimum Clock Period at which this circuit can operate & what shall be the Critical Path? (5 Marks) Que-2 : What is the maximum achievable Throughput (in Bits/Sec) for this circuit ? (2 Marks) Que-3 : What is the Minimum Clock Period at which this circuit can operate IF CLK’ has a lead of 2 ns as compared to CLK (3 Marks)

**Solution: No Skew Case All Paths: AB : 4+4+2 = 10 BC: 4+5+2 = 11 BB: 4+1+4+2 = 11 CB: 4+2+4+2 = 12
**

Comb. Logic H

output

8 Bit

Skew Case All Paths: AB : 4+4+2 = 10 BC’: 4+5+2+2 = 13 (Lead) BB: 4+1+4+2 = 11 C’B: 4+2+4+2-2 = 10 (Lag Case)

X

Y Comb. Logic F

input

D Q

FFA

D Q

FFB

Z

Comb. Logic G

8 Bit

D Q

FFC

TClk-Q = 4 ns Ts = 2 ns Tlogic(F) = 4ns Tlogic(G) = 5ns Tlogic(H) = 1ns Tlogic(I) = 2ns

CLK

CLK

CLK’

Comb. Logic I

45

Reading Assignment

Optimizing Timing

**(Reading Assignment-Steve Kilts – Section 1.3)
**

FIR Pipelining Parallel Architecture Register Balancing Logic Flatenning

**Consider an FIR Filter
**

The equation for the computation of an L-taps FIR filter is: If L=5 y[0]= h0x0 + h1x-1 + h2x-2 + h3x-3 +h4x-4 y[1]= h0x1 + h1x0 + h2x-1 + h3x-2 +h4x-3 y[2]= h0x2 + h1x1 + h2x0 + h3x-1 +h4x-2 y[3]= h0x3 + h1x2 + h2x1 + h3x0 +h4x-1 y[4]= h0x4 + h1x3 + h2x2 + h3x1 +h4x0 y[5]= h0x5 + h1x4 + h2x3 + h3x2 +h4x1

**Parallel FIR Implementation
**

47

48

Critical Path ??

module fir( output [7:0] Y, input [7:0] A, B, C, X, input clk, input validsample); reg [7:0] X1, X2, Y; always @(posedge clk) if(validsample) begin X1 <= X; X2 <= X1; Y <= A* X+B* X1+C* X2; end endmodule

Technique-1- Pipelining <Reducing TLOGIC+PROPAGATON>

Code

50

reg [7:0] X1, X2, Y; reg [7:0] prod1, prod2, prod3; always @ (posedge clk) begin if(validsample) begin X1 <= X; X2 <= X1; prod1 <= A * X; prod2 <= B * X1; prod3 <= C * X2; end Y <= prod1 + prod2 + prod3; end endmodule

**Technique-2- Increasing Parallelism <Speeding-up the logic-process>
**

51

**…. Optimize the critical path such that logic structures could be implemented in parallel Example:
**

For

the x-cube code break the multipliers into independent operations and then recombine them.

**Optimizing Logic by adding Parallelism
**

52

**Assume we are squaring an 8-bit number
**

can

be represented by nibbles A and B:

**Technique-3- Register Balancing
**

<Distribute long logic paths evenly across register layers>

53

Keep a balance in the critical path Redistribute logic evenly between registers to minimize the worst-case delay between any two registers WITHOUT ADDING EXTRA REGISTERS

54

**Technique-4- Flatten Logic Structures <Removing redundant logic>
**

55

Break up logic structures that are coded in a serial fashion Avoiding Priority Structures if not required

**control signals coming from an address decode that are used to write four 1-bit registers
**

56

module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) if(ctrl[0]) rout[0] <= in; else if(ctrl[1]) rout[1] <= in; else if(ctrl[2]) rout[2] <= in; else if(ctrl[3]) rout[3] <= in; endmodule

57

**If the control lines are strobes from an address decoder in another module
**

Each

strobe is mutually exclusive to the others as they all represent a unique address.

Is there any need for priority structure ?

58

module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) begin if(ctrl[0]) rout[0] <= in; if(ctrl[1]) rout[1] <= in; if(ctrl[2]) rout[2] <= in; if(ctrl[3]) rout[3] <= in; end endmodule

Tip

59

**Technique-5- Reordering Paths <Shortening Critical Paths>
**

60

Mostly done by synthesizer !!! Reorder the paths in the dataﬂow to minimize the critical path When to use:

Where multiple paths combine with the critical path The combined path can be reordered such that the critical path can be moved closer to the destination register

**Technique-5- Reordering Paths
**

61

Events not mutually exclusive

module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); always @(posedge clk) if(Cond1) Out <= A; else if(Cond2 && (C < 8)) Out <= B; else Out <= C; endmodule

62

module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); wire CondB = (Cond2 & !Cond1); always @(posedge clk) if(CondB && (C < 8)) Out <= B; else if(Cond1) Out <= A; else Out <= C; endmodule

**Summary- Architecting Speed
**

63

High Throughput

Pipelining Parallelism Pipeline Removal

Low Latency

Timing

**Parallelism Pipelining Flattening Logic Structure Register Balancing Path Reordering
**

In your digital design

Make your specification as your goal and apply the techniques

Recap

A high-throughput architecture is one that maximizes the number of bits per second that can be processed by a design. Unrolling an iterative loop increases throughput.

64

**The penalty for unrolling an iterative loop is a proportional increase in area.
**

A low-latency architecture is one that minimizes the delay from the input of a module to the output. Latency can be reduced by removing pipeline registers The penalty for removing pipeline registers is an increase in combinatorial delay between registers. Timing refers to the clock speed of a design. A design meets timing when the maximum delay between any two sequential elements is smaller than the minimum clock period Adding register layers improves timing by dividing the critical path into two paths of smaller delay. Separating a logic function into a number of smaller functions that can be evaluated in parallel reduces the path delay to the longest of the substructures. By removing priority encodings where they are not needed, the logic structure is ﬂattened, and the path delay is reduced. Register balancing improves timing by moving combinatorial logic from the critical path to an adjacent path Timing can be improved by reordering paths that are combined with the critical path in such a way that some of the critical path logic is placed closer to the destination register

Reading

Chapter 3 of Parhi, VLSI Digital Signal Processing Systems

Dr. Rehan Hafiz

<rehan.hafiz@seecs.edu.pk>

66

Optimizing circuit Output Sampling Rate

Reading

**Parhi, VLSI Digital Signal Processing Systems
**

Chapter

3

Shall be updated

The story…

We know what is Throughput (Samples/Sec) Output Sample rate depends upon Throughput Throughput depends upon Clock Period Clock Period Depends upon Critical Path Critical Path – The longest combinational path between any two registers in your design So if we improve upon the timing – we shall improve upon the Throughput

**Techniques for improving Output Sampling Rate
**

Register Balancing

Has no/less penalty to area. Does not adds extra delays. Improves Timing

**Delay Transfer Theorem (using DFGs)
**

Pipelining

Has area penalty. Adds extra delay to the system. Improves Timing

Feed forward cut set pipelining for DFG (Data Flow Graphs) Register insertion & Delay Transfer Theorem Fine grained pipelining

Parallelism

Has area penalty. Increases the number of samples per clock cycle May/May not Improves Timing – but improves output sampling rate

Where above two techniques fail

Combining Parallelism & Pipelining

**Data Flow Graphs (DFGs)
**

(Feed Forward/Non Recursive)

**Direct Form FIR Filters
**

x(n)

Z-1 Z-1 Z-1

h0

h1

h2

hM-1

y(n)

**M-tap FIR filter in direct form Critical path:
**

M 1

TA = delay through adder TM = delay through multiplier Critical path delay: 1 TM +(M-1) TA M-1 registers M multipliers M-1 adders

y ( n)

h(i) x(n i) h(n) x(n)

i 0

Area:

**Arithmetic complexity of M-tap filter modeled as:
**

M multiplications/sample + M-1 adds/sample

[PAR]

**Representations of DSP algorithms and architectures Block Diagram
**

72

Block diagram of a 3-tap FIR filter [PAR]

**Representations of DSP algorithms and architectures Data Flow Graph – Representation !
**

73

•Nodes represent Computations/tasks: e.g: Addition, Multiplication •Computational time for a node can be specified with the node •Edges have a non-negative no. of delays associated with it •A node shall only compute once all the input data is ready •Non Recursive DFG Systems have no loops in a DFG

Data Flow Graph of a 3-tap FIR filter

[PAR]

**Understanding T from DFG Computing Critical Path
**

74

[PAR]

Register Balancing

**Delay Transfer Theorem
**

The delay transfer theorem helps in systematic shifting of registers across computational nodes.

**This shifting does not change the transfer function of the original DFG.
**

The theorem states that, without affecting the transfer function of the system, N registers can be transferred from each incoming edge of a node of a DFG to all outgoing edges of the same node, or vice versa.

[SHO]

Convolution Example

[PAR]

Pipelining using Delay Transfer Theorem Pipelining using FEEDFORWARD CUTSETS !

Pipelining

Pipelining using Delayed Transfer Theorem

**Pipelining using Delayed Transfer Theorem
**

A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node transfer theorem, systematically move the registers to break the delay of the critical path.

**Pipelining using the Delay Transfer Theorem
**

Feedforward – only (Example-1)

81

A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node transfer theorem, systematically move the registers to break the delay of the critical path. Functionality is not changed if a register is transferred from all incoming edges of node (e.g. FA0) to all outgoing edges & vice versa !

Article : 7.2.7 [SHO] [SHO]

[SHO]

**Pipelining using the Delay Transfer Theorem
**

Feedforward – only (Example-2)

83

[SHO]

**Important Input Nodes
**

84

For multiple input multiple output system; Add a Source Node (generating all inputs) & add a Destination Node receiving/sinking all outputs.

**This will help confusions in your design to particular order of placement of various nodes
**

& of course you can’t CUT NODES Source Node

Destination Node

Pipelining using Feedforward Cutset

**Pipelining using FEEDFORWARD CUTSETS !
**

86

CUTSET

A set of edges that if removed or cut results in two DISJOINT graphs/(parts) : Part1 & Part2 A CUTSET is called a FEED FORWARD CUTSET if the data direction is same on all paths connecting the two parts. (i,.e. Either from Part1 to Part2 or from Part 2 to Part 1)

**FEED-FORWARD CUTSET
**

**FEED-FORWARD CUTSET Pipeling
**

Put a register at every path that is intersected by a valid CUTSET

In figure below the red lines show two possible VALID Cutsets

[PAR]

**Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining – Example (1/4)
**

87

[PAR]

**Pipelining using FEEDFORWARD CUTSETS !
**

In the Figure :D shows Registers, RED colored D are registers inserted due to Feed forward Cutset Pipelining. The numbers with each node show the computational time required for each node in the path.

88

=18

Note: You can add as many feedf-orward cutset pipeline stages as you want

The T-Critical after Pipelining is now reduced to 4+6 = 10

[PAR]

**Pipelining using FEEDFORWARD CUTSETS ! An example of invalid FEED FORWARD CUSET
**

89

The purple line shows an invalid CUTSET. We cannot put pipeline registers along the purple line.

**Three edges going from Part-1 to Part-2 & one edge going from Part-2 to Part-1
**

[PAR]

**Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining – Example (2/4)
**

90

[PAR]

**Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining – Example (3/4)
**

91

x(n)

Z-1

Z-1

Z-1

Z-1

h0

h1

Z-1

h2

hM-1

x(n)

Z-1

Z-1

Z-1

Put delay on all cuts Z-1

h0

h1

h2

hM-1

[PAR]

**Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining – Example (4/4)
**

92

**Let Tm = 10 units, Ta = 2 units, Desied clock = 6 units ! Initial Design be:
**

x(n) hM-1

Z-1

hM-2

Z-1

hM-3

Z-1

h0 y(n)

**Fine Grained Pipelining
**

x(n)

Z-1

hM-1

hM-2

Z-1 Z-1

hM-3

Z-1

h0 y(n)

insert registers here

[PAR]

Another Example

Note the critical path is still not ideally balanced

[SHO]

Fine Grained

[SHO]

95

Technique : DFG based Parallel Processing Data Flow Graphs (DFGs)

**Technique : DFG based Parallel Processing Data Flow Graphs (DFGs)
**

96

What if we can’t optimize our system anymore using pipelining ? Convert a SISO system to a MIMO system using parallel logic ! The effective sampling speed is increased by the level of parallelism: L Multiple outputs are computed in parallel in a clock period

Parallel processing system is also called block processing, and the number of inputs processed in a clock cycle is referred to as the block size : L [PAR]

**Technique : DFG based Parallel Processing An Example
**

97

Suppose we have multiple inputs available on every processing clock !

[PAR]

**SISO to MIMO Conversion !
**

98

[PAR]

**Technique : DFG based Parallel Processing Example : FIR Filtering !
**

99

**Consider a single-input single-output (SISO) FIR filter:
**

y(n)

=ax(n)

+bx(n-1) +cx(n-2)

Convert the SISO system into an MIMO (multiple-input multiple-output) system in order to obtain a parallel processing structure.

To get a parallel system with L = 2 inputs per clock cycle; we re-write the equations as :

i.e. At the Kth cycle two outputs are processed : y(2k) & y (2k+1)

[PAR]

**Important: Delays in a MIMO system
**

100

[PAR]

**2 Parallel 3-Tap Filter !
**

101

[PAR]

102

[PAR]

**Combining Parallelism & Pipelining
**

103

By combining parallel processing (block size: L) and pipelining (pipelining stage: M), the sample period can be reduced to:

[PAR]

Technique : Parallel Processing + Pipelining Example : FIR Filtering !

[PAR]

Questions….

ADSD

Legacy Material

**Representations of DSP algorithms and architectures Signal Flow Graph – Representation !
**

107

Collection of Nodes & Directed Edges A directed edge (j,k) denotes a node originating at node j & terminating at node k Edge (j,k) denotes a linear transformation from signal at node j to signal at node k – Can specify Gain

Signal Flow Graph of a 3-tap FIR filter

**Nodes represent computations or tasks e.g: Addition
**

Source Node : No input edges; Sink Node : No originating edges

**Technique Signal Flow Graph From Direct Form to Transpose Form
**

108

Reversing the direction of an SFG and interchanging the input and output ports preserves the functionality of the system.

**Also called data broadcast structure
**

x(n)

hM-1

Z-1

hM-2

Z-1

hM-3

Z-1

h0 y(n)

Critical path:

Delay: 1 TM + 1 TA M-1 registers + M multipliers +M-1 adders Larger register sizes depending on quantization scheme used; since registers are now placed after multiplication ! Fanout of x(n) can become prohibitive

Area:

Disadvantages

- Chapter 3
- Ch3
- 10-17-13.pdf
- i Jw Cnt 03112012
- MIT Reference Slides
- gliches
- Report
- 1.Metastability
- ca tp - Copy
- ADSD Lecture 4
- MCQ-1
- vlsifinalreport 1
- Report Digital
- 96150294-cs1104-11
- Flip Flops
- DSD LAB
- VHDL-Reference.pdf
- Digital
- Design of Low Power Data Preserving Flip Flop Using MTCMOS Technique
- m[1].Tech Lab Record Vhdl 13 Batch
- Timing Parameters of a Digital Circuit
- A Low Power Memory Architecture for Zigbee Trans-Receiver
- NR 311402 Digital Electronics
- BG Sequence Control Logic
- Presentation Part 1_ Background and Motivation
- Two-wire remote control unit for speed control of DC shunt motor
- Power Optimization of IO Ports Using Clock Gating Technique
- dm74ls174
- Basic Flip Flops
- sequential circuit design and flip flops.unlocked

Read Free for 30 Days

Cancel anytime.

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading