You are on page 1of 24

Chapter 4 Retiming

ECE734 VLSI Arrays for Digital Signal Processing 1


Definitions

• Retiming
Retiming is a mapping from a given DFG, G
to a retimed DFT, Gr such that the
corresponding transfer function of G and Gr
differ by a pure delay z-L.
• Purposes
– To facilitate pipelining to reduce clock cycle
time
– To reduce number of registers needed.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 2
Cut-set Retiming

• Feed-forward cut-set: • Delay transfer theorem


– Adding arbitrary non-
negative number of delays
to each edge of a feed-
forward cut-set of a DFG will
not alter its output, except
the output timing will be
delayed.
• Feed-back cut-set – Transfer the same amount
of delays from edges of the
same direction across a
feed-back cut set of a DFG
to all edges of opposing
edges across the same cut
set will not alter the output,
but its timing.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 3
Feed-forward Cut-Set Retiming
• Consider the FIR digital filter • Retiming:
and its DFG: ynew(n) = b0x(n-1) + b1x(n-2)
y(n) = b0x(n) + b1x(n-1)
ynew(n) = y(n-1)
• Critical path = Max(TM, TA)
x(n) D
x(n-1)

X b0 X b1 x(n) D
x(n-1)

+ y(n) X b0 X b1

• Critical path length = TM+TA


D D
• Select a cut set
• Insert a delay each to each
edge in the cut set. + y(n)

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 4
Feed-back Cut Set Retiming

• Consider an IIR digital filter • Shift 1 delay to the other


y(n) = a·y(n-2) + x(n) edge across a feed-back
cut set
x(n) y(n) x(n) y(n)
+ +
2D D
D


a

a

loop bound = (TM+TA)/2 • Filter remains unchanged.


clock cycle = TM+TA loop bound = (TM+TA)/2
clock cycle = Max(TM ,TA)

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 5
Timing Diagram

• Assume tM = tA = 1 t.u.
• Before retiming
x(1) x(2) x(3) x(4)
MAC 1 2 3 4
y(1) y(2) y(3) y(4)

• After retiming

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)


Add 1 2 3 4 5 6 7 8
y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)
a y(1)

Mul 0 1 2 3 4 5 6 7 8

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 6
Feed-back Cut Set Retiming

• Consider an IIR digital filter x(2k-1)=x(k)


y(n) = ay(n-1) + x(n) x(2k) = 0

x(n) y(n)
+ x(m) y(m)
+
D
2D


a

a

loop bound = (TM+TA) Clock period = (TM+TA)


throughput = 1/(TM+TA) Throughput = 1/[2(TM+TA)]

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 7
Slowdown + Retiming
Start with Start with
y(n) = a y(n-1) + x(n) y(n) = a y(n-2) + x(n)

x(n) y(n)
x(m)
+
y(m) +
D
D D
D


a

a

clock cycle = Max(TM ,TA) loop bound = (TM+TA)/2


Throughput = 1/[2max(TM,TA)] clock cycle = Max(TM ,TA)
throughput = 1/ Max(TM ,TA)

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 8
Example 3.2.1
a2 D a4
a6
a1
• Node delay = 1 t.u. D
• Before retiming: a3 a5
– Critical path: a3  a4  a5 
a6
– Clock cycle time = 4 D a4
D a2
– 2 delay units a6
a1 D
• After cut-set retiming D
– Critical path: a3  a5, a4  a6 D
– Clock cycle time = 2 D
a3 a5
– 6 delay units
• After additional retiming D a2
2D a4
D
– Critical path: none a6
a1 2D
– Clock cycle time = 1 D
– 11 delay units D
D 2D
a3 a5

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 9
Slow Down for Cut-Set Retiming

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 10
Node Retiming
• Transfer delay through a node • Retiming equation:
in DFG:
u e v
3D D
2D
r(v) = 2 wr (e)  w(e)  r (v) - r (u)
v v subject to wr(e)  0.
2D
D 3D • Let p be a path from v0 to vk
v0
e0
v1
e1 … ek
vk
• r(v) = # of delays transferred p
k -1
from out-going edges to
incoming edges of node v w(e) then wr ( p)   wr (ei )
i 0
= # of delays on edge e k -1
• wr(e) = # of delays on edge e    w(ei )  r (vi 1 ) - r (vi ) )
after retiming i 0

 w( p)  r (vk ) - r (v0 )

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 11
Invariant Properties

1. Retiming does NOT change the total number of


delays for each cycle.
2. Retiming does not change loop bound or iteration
bound of the DFG
3. If the retiming values of every node v in a DFG G are
added to a constant integer j, the retimed graph Gr
will not be affected. That is, the weights (# of delays)
of the retimed graph will remain the same.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 12
Node Retiming Examples

r(2) = 1

y (n)  x(n)  w(n - 1) y (n)  x(n)  w1 (n - 1)  w2 (n - 1)


w(n)  a  y (n - 1)  b  y (n - 2) w1 (n)  a  y (n - 1)
w2 (n)  b  y (n - 2)

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 13
DFG Illustration of the Example

T = max. {(1+2+1)/2, (1+2+1)/3} = 2 T = max. {(1+2+1)/2, (1+2+1)/3} = 2


Cr. Path delay = 2+1 = 3 t.u Cr. Path Delay = max{2,2,1+1} = 2 t.u

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 14
Retiming for Minimizing Clock Period

• Note that retiming will NOT • To generalize, for any path


alter iteration bound T. from v0 to vk, we have
• Iteration bound is the wr ( p)  w( p)  r (vk ) - r (v0 )
theoretical minimum clock k
period to execute the If t ( p)   t (vi )  T ,
algorithm. i 0

• Let edge e connect node u then we require wr ( p)  1.


to node v. If the node
computing time t(u) + t(v) > • In other words, for any
T, then clock period T > T. possible critical path in the
For such an edge, we DFG that is larger than T,
require that we require wr(e)  1.
wr (e)  1

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 15
Retiming Example Revisited

wr(e21)  0, since t(2)+t(1) = 2 = T.


wr(e13)  1, since t(1)+t(3) = 3 > T.
wr(e14)  1, since t(1)+t(4) = 3 > T.
wr(e32)  1, since t(3)+t(2) = 3 > T.
wr(e42)  1, since t(4)+t(2) = 3 > T.
Use eq. wr(euv) = w(e) + r(v) – r(u),
w(e21) + r(1) – r(2) = 1 + r(1) – r(2)  0
w(e13) + r(3) – r(1) = 1 + r(3) – r(1)  1
w(e14) + r(4) – r(1) = 2 + r(4) – r(1)  1
T  2 w(e32) + r(2) – r(3) = 0 + r(2) – r(3)  1
w(e42) + r(2) – r(4) = 0 + r(2) – r(4)  1

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 16
Solution continues

• Since the retimed graph Gr • Since


remain the same if all node 1  r (2)  r (3)  1  0  1  1
retiming values are added by one must have r(2) = 1.
the same constant. We thus
can set r(1) = 0. • This implies r(3)  0. But we
also have r(3)  0. Hence
• The inequalities become r(3)=0.
1 – r(2)  0 or r(2)  1 • These leave –1  r(4)  0.
1 + r(3)  1 or r(3)  0 • Hence the two sets of
2 + r(4)  1 or r(4)  –1 solutions are:
r(2) – r(3)  1 or r(3) r(2) - 1 r(0) = r(3) = 0, r(2) = 1, and
r(2) – r(4)  1 or r(2)  r(4)  1 r(4) = 0 or -1.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 17
Systematic Solutions

Given a systems of a) The system of inequalities


inequalities: has a solution if and only if
the constraint graph
r(i) – r(j)  k; 1  i,j  N contains no negative cycles
Construct a constraint graph: b) If a solution exists, one
1. Map each r(i) to node i. Add solution is where ri is the
a node N+1. minimum length path from
2. For each inequality the node N+1 to the node i.
r(i) – r(j)  k,
draw an edge eji Shortest path algorithms:
such that w(eji) = k. (Applendix A)
1. Draw N edges eN+1,i = 0. Bellman-Ford algorithm
Floyd-Warshall algorithm

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 18
Bellman-Ford Algorithm
Find shortest path from an -3
arbitrarily chosen origin node U 1 2
to each node in a directed 1 1
graphif no negative cycle exists. 1
Given a direct graph 2
4 3
w(m,n): weight on edge from
node m to node n, =  if there
is no edge from m to n
 0 -3    2 2 2
r(i,j): the shortest path from node U  0
to node i within j-1 steps. 1 1  0 0 -1 -1
W  r
r(i,1) = w(U,i),   0 2 1 1 1 0
r(i,j+1) = min {r(k,j) + w(k,i)},    
j = 1, 2, …, N-1 1   0 1 1 1 0
if max(r(:,n-1)-r(:,n))>0, then
there is a negative cycle. Else,
r(i,n-1) gives shortest cycle Note that 1 > 0, hence there is at
length from i to U. least one negative cycle.

spbf.m
(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 19
Floyd-Warshall Algorithm
-3
Find shortest path between all 1 2
2 1
possible pairs of nodes in 1
the graph provided no 2
4 3
negative cycle exists.
Algorithm:  0 -3   0 -3 -2 -1
Initialization: R(1) =W;  0 1 2  (2) 3 0 1 2 
W  R 
For k=1 to N   0 2  3  0 2
   
R(k+1)(u,v) = min{R(k)(u,:) + 1   0 1 -2  0 
R(k)(:,v)}
0 -3 -2 -1
If R(k)(u,u) < 0 for any k, u, then 3 0 1 2 
a negative cycle exist. Else, R (3)  R (4) R 
(5)
3 0 0 2
R(N+1)(u,v) is SP from u to v  
1 -2 -1 0

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 20
Retiming Example

• For retiming example: • Bellman-Ford Algorithm for


– r(2) – r(1)  1 Shortest Path
– r(1) – r(3)  0 0 1   
– r(1) – r(4)  1  0 -1 -1  

– r(3) – r(2)  –1 W  0  0  
– r(4) – r(2)  –1  
1   0 
 0 0 0 0 0 
-1
0 0 -1 -1
0 1 2
0 0 0 0 
3 1
1 -1

R  0 -1 -1 -1
4  
0 0
0 -1 -1 -1
0 0 0 0 0 0 
5

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 21
Retiming Example

• Floyd-Warshall algorithm

0 1    0 1 0 0 
 0 -1 -1    -1 0 -1 -1  
 
W  R (1)  0  0   R (3)  R (4)  R (5)  R (6) 0 1 0 0 
   
1   0  1 2 1 0 
 0 0 0 0 0   -1 0 -1 -1 0 
0 1 0 0 
 -1 0 -1 -1  

R (2) 0 1 0  
 
1 2  0 
 0 0 -1 -1 0 

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 22
Retiming to Reduce Registers
D

D
Delay
reduction
D

• Register Sharing • Register reduction through node


When a node has multiple fan-out delay transfer from multiple
with different number of delays, the input edges to output edges
registers can be shared so that (e.g. r(v) > 0)
only the branch with max. # of
delays will be needed.
• Should be done only when clock
cycle constraint (if any) is not
violated.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 23
Time Scaling (Slow Down)
• Transform each delay … x(3) x(2) x(1) … y(3) y(2) y(1)
+
element (register) D to ND
and reduce the sample D
frequency by N fold will slow
down the computation N 
times.
• During slow down, the … -- x(3) -- x(2) -- x(1) … y(3) -- y(2) -- y(1)
processor clock cycle time +
remains unchanged. Only
the sampling cycle time 2D


increased.
• Provides opportunity for
retiming, and interleaving.

(C) 2004-2006 by Yu Hen Hu ECE734 VLSI Arrays for Digital Signal Processing 24

You might also like