VLSI Programming Systolic Design: Book Parhi, Chp. 7 Rudolf Mak R.h.mak@tue - NL

VLSI programming
Systolic Design
Book Parhi, Chp. 7
Rudolf Mak
r.h.mak@tue.nl
18-May-16 Rudolf Mak TU/e Computer Science Systolic 1

Agenda
• Systolic arrays (what, where)
• Regular Iterative Algorithms (RIAs)
• Dependence graphs (regular, reduced)
• Systolic design techniques
– Binding (computations to PEs)
– Scheduling (computations to time slots)
• Examples
– Fir filters, matrix multipliers

FSM reminder
Moore machine Mealy machine
CL CL
state state
Chaining Mealy machines may lead too long critical paths!

Systolic system (Leiserson)
A systolic system is a set of interconnected Moore
machines that operate synchronously and satisfy
certain smallness (boundedness) conditions:
1. # states is bounded
2. # input ports is bounded
3. # output ports is bounded
4. # neighbor machines is bounded
“#” stands for “number of”

Systolic = Uniform Pipelined SDF
• Uniform:
– Each PE (Moore machine) computes the same
set of combinatorial functions.
• Regular:
– All PEs are connected to a small finite number of
neighboring PEs via one or more D-elements
according to a regular topology. All connections
are point-to-point connections.
• Synchronous operation:
– All PEs operate in lock step (fire concurrently) ;
data is pumped through the system, much like
the hart pumps blood through the body (hence
the name systolic).

Relaxations
• To obtain better systems small relaxations to
the systolic model are allowed:
1. Not all PEs are identical, small deviations are

allowed especially for PEs at the border of the
system.
2. (A limited form) of broadcasting is allowed. This
means that PEs have become Mealy machines.
1. These systems are called semi-systolic by Leiserson.
2. Parhi does not make the distinction. Instead he uses the
notion fully pipelined for the Moore machine variant.
3. Connections need not be to nearest neighbors, but
locality needs to be maintained.

Such as a
Systolic system Power PC
on a FPGA
Host
Turing-equivalent machine
PE PE PE PE PE
Systolic array: Moore machines
Such as a dedicated computing engine on a FPGA

Application areas
• Computationally intensive, regular
– Basic linear algebra operations
– Signal processing
– Image processing
– Order statistics, sorting
– Dynamic programming
– High performance computing
• e.g., many particle simulations (in chemistry,
physics or astronomy)

FIR filter (N-tap)
, 0 Spec
, , 0
,0
RIA
, 1 1
1 1
1, 1
, 0 does not work!!!
, 1,
, , 1 or , 1
Regular Iterative Algorithm
A RIA is a triple consisting of , is input
1. An index space { , | 0 , 0
2. A finite set of variables ! , ,
3. A set of direct dependencies among indexed
variables (given as equalities)
• with associated index displacement vectors
• also called fundamental edges by Parhi
Canonical forms:
1. Standard input
2. Standard output

FIR-filter: RIA description
Standard output canonical form:
, , , 1, 1 , , 0
, 1, , 1, 1 1,
= ,
, , 1 , 1
, 1 ,
Index displacement vectors: LHS = RHS + IDV
→ → → → → # $ → % $
0, 1 1, 0 0, 0 0, 0 1, 1
0, 1

Computational node
& node g ( ( +1 &, '

) +2 &, '
* +3 &, '
)
+1
. +2
+3
' *

Computational node from RIA
1, 1 ,
I(g)
1, ,
I(g) is the index
vector, i.e., the
sequence of
, 1 ,
coordinates of g
in index-space
, , , 1, 1
Dependence graphs
1. The nodes of a dependence graph represent (small)
computations. There is a separate node for each com-
putation.
2. The edges of a dependence graph represent causal

dependencies between computations, i.e., an edge
from node to node indicates that the result of
the computation performed by is used in the
computation performed by .
3. There is no notion of time in a dependence graph. It is

an (index-)space representation.

FIR: Dependence graph
0 1 1 2 2
x(0) x(1) x(2) x(3) x(4)
h(2)
h(1)
h(0)
y(0) y(1) y(2) y(3) y(4)

FIR: Dependence graph
0 1 1 2 2
x(0) x(1) x(2) x(3) x(4)
0 0 0 0 0
h(2)
0
h(1)
0
h(0)
y(0) y(1) y(2) y(3) y(4)

Regular dependence graphs
A dependence graph / is regular when:
1. There is a injective mapping 0 from the
nodes of / to a grid of points in the -
dimensional index space.
2. There exists a finite set 1 of vectors, called
fundamental edges, such that every pair ,
of neighboring nodes is mapped to a pair of
grid locations that differ by a fundamental
edge 2 ∈ 1, i.e., 0 0 2.

FIR: DG in space representation
x(0) x(1) x(2) x(3) x(4)
h(2)
h(1)
(1,-1)
(0,1)
h(0)
(1,0)
y(0) y(1) y(2) y(3) y(4)
fundamental edges 1 1 0 1
24 25 |26
0 1 1
Systolic array design
The design of a systolic array for a computation
given in the form of a regular dependence graph
involves:
1. Choosing a processor space, i.e., a set of dimensions

and a number of PEs per dimension (the array).
2. Mapping each computational node of the graph to a
PE of the array. Similar to folding
3. For each PE scheduling the computations of the
nodes mapped onto it, i.e., assigning each individual
computation to a distinct time slot.

Design parameters
An 1)-dimensional systolic design for an
-dimensional regular dependence graph is
characterized by:
1. A 7 1 processor space matrix 8:
:
90 is the processor that executes node
2. A -dimensional scheduling vector ; :
:
< 0 is the time slot at which node x is executed
3. A projection (iteration) vector = :
:
0 – 0 ? = implies 9 0 9@ 0

Design constraints
• Computations whose grid locations differ by
a multiple of the projection vector execute on
the same PE
:
– 0 – 0 ? = implies 9 0 9@ 0
:
– hence 9 = 0
• Computations that execute on the same PE

must be scheduled in different time slots
:
– < 0 is the time slot at which node is
executed
:
– hence ; = A 0

B: 0, 1
Processor allocation:
): 1, 0
x(0) x(1) x(2) x(3) x(4)
h(2)
processors
h(1)
h(0)
y(0) y(1) y(2) y(3) y(4)
B:
Scheduling: C: 1, 0
): 1, 0
x(0) x(1) x(2) x(3) x(4)
0 1 2 3 4
h(2)
0 1 2 3 4
h(1)
0 1 2 3 4 h(0)
y(0) y(1) y(2) y(3) y(4)
C: time

Hardware Utilization Efficiency (HUE)
Let and y computations with index vectors 0 ,0
that are executed on the same PE.
• Then 0 0 ?=.
Let $D be the time at which is scheduled and $E

be the time at which is scheduled.
: :
• Then $D $E ;@ 0 0 ? ; = F |; =|.
Hence, any PE executes at most 1 computation

:
per ; = time slots. So
:
HUE = 1 / | ; = |
:
Question: How do we call ; = ?
From DG to systolic array
Map a DG onto a systolic array as follows:
• Nodes:
:
– map x to processing element 9 0
• Edges
: :
– map → to connection 9 0 →9 0
:
– insert ; 2 D-elements in this edge, where
2 0 – 0 , is a fundamental edge
Note that there are only finitely many fundamental grid

edges (independent of the size DG), and recall that
each edge is a translation of a fundamental edge.
B1: H-stay, X-broadcast, Y-move
: : :
: 2 H* ;*
= 1, 0
:
1, 0 0 1
H 0, 1
0, 1 1 0
:
; 1, 0 1, 1 1 1
PE PE PE

B1: H-stay, X-broadcast, Y-move
h0 h1 h2
x(i)
0
y(i) v(i) u(i)
y(i) = h0·x(i) + v(i-1), v(i) = h1·x(i) + u(i-1), u(i) = h2·x(i) + 0
y(i) = h0·x(i) + h1·x(i-1) + h2·x(i-2)
HUE = 1 / | sTd | = 1
Determining 8, ;, and =
• Trial-and-error approach
– Pick a combination and check whether the design
constraints are fulfilled.
• Constructive approach
1. Determine a schedule ;.
:
2. Determine a projection

vector = such that ; = A 0
: :
3. Let I = = 0 – == . Then I is a matrix of rank
:
1 such that I = 0. By sweeping, a zero
column can be created in Q. Drop this column to
obtain a 7 1 -matrix 8.

FIR-designs (Parhi)
T T T T T
s d p p (eh|ex|ey) s (eh|ex|ey) reverse
direction
B1 (1, 0) (1, 0) (0, 1) (0, 1,-1) (1, 0, 1)
funda-
F (1, 1) (1, 0) (0, 1) (0, 1,-1) (1, 1, 0) mental
W1 (2, 1) (1, 0) (0, 1) (0, 1,-1) (2, 1, 1) edge
W2 (1, 2) (1, 0) (0, 1) (0, 1,-1) (1, 2,-1) ey = -ey

DW2 (1,-1) (1, 0) (0, 1) (0, 1,-1) (1,-1,2) ex = -ex
B2 (1, 0) (1,-1) (1, 1) (1, 1, 0) (1, 0, 1)
R1 (1,-1) (1,-1) (1, 1) (1, 1, 0) (1,-1, 2) ex = -ex
R2 (2, 1) (1,-1) (1, 1) (1, 1, 0) (2, 1, 1)
DR2 (1, 2) (1,-1) (1, 1) (1, 1, 0) (1, 2, -1) ey = -ey
Design R1: dependence graph
X(0) X(1) X(2) X(3) X(4)
h(2)
h(1)
(1,-1)
(0,-1)
h(0)
(1,0)
y(0) y(1) y(2) y(3) y(4)
1 0 1
fundamental edges E = ( eh | -ex | ey) =
0 -1 -1
Space-time diagram R1
J ): 1, 1 ,
10 B: 1, 1 ,
C: 1, 1
6
L B: 0 M
N C: 0 M
4
0 2 4 6 8 10 12 K

Processor allocation R1: dT = (1, -1)
X(0) X(1) X(2) X(3) X(4)
h(2)
h(1)
h(0)
y(0) y(1) y(2)
L B: , : LO) 3

Scheduling R1: sT = (1, -1) dT = (1, -1)
X(0) X(1) X(2) X(3) X(4)
0 4 5 6 10
h(2)
1 2 6 7 8
h(1)
2 3 4 8 9
h(0)
y(0) y(1) y(2) y(3) y(4)
N C: , : 3 P 3 ! 2

R1: H-move, X-move, Y-stay
: : :
: 2 H2 ;2
= 1, 1
:
1, 0 1 1
H 1, 1
0, 1 1 1
:
; 1, 1 1, 1 0 2
PE PE PE

0 0 0 0 0
x00x1
04Y005 05Y105 06Y205
0 0 1
2 0 1 0 2 0
3 3 4
5 4 5
h1 0 h2 0
0 h0
At time: 0 HUE = 1 / | ;Q= | = 1 / 2

(2-slow)

V W
X
0 0 x00x1
1
06Y205Y5
0 0 2 0
H 4 Y
5
0 h2 0
0 h0

N R S $T
0 U 0 0 \ \ 0
1 0 U ∗ 0 0 / / 0
2 0 U ∗ x / / 0
3 0 U ∗ ∗ 0 0 \ \ 0
4 0 U ∗ ∗ U / / 0
5 0 U 0 0 / / 0
6 U 0 U \ \ \ 0
7 0 U ∗ \ 0 0 / / U
8 0 U ∗ \ _ / / 0
9 0 U ∗ \ ∗ _ 0 0 \ \ 0
10 0 U ∗ \ ∗ _ a / / 0
11 0 a 0 0 / / 0
12 U 0 a b \ \ 0
13 0 U ∗ a 0 0 / / a

Matrix multiplication c 7 c : RIA
( , ∑ : 0 c: & , ' ,
f , , g ∑ : 0 g: & , ' ,
h , , g & , g 1
i , , g ' g 1,
f , ,0 0
f , ,g f , , g 1 h , , g i , , g
h , ,g h , 1, g
i , ,g i 1, , g
+Oj 0 , , g c

Dependence graph
for c 3 (Finite!)
2
k
j
B i
A 1
C 2
1
0 0
0 1 2

Kung-Leiserson design
• Scheduling vector
: 2 8Q 2 ;Q 2
– ; 1, 1, 1
0
0
• Projection vector h 1 1
: 1
– = 1, 1, 1 0
1
1
• Projection space matrix i 0 1
0
0
0
: 1 0 1 1
– 9 f 0 1
1
0 1 1 1
• HUE = 1 / 3
x = i-k
Kung-
Leiserson
(3x3)-matrix
multiplication
systolic array
delay-elements
not drawn: one y = j-k
on each edge!
KL-array
processor
allocation
( binding )
unbalanced
workload

Dependence graph
for c 3
2
k
j
B i
1
A
C 2
1
0 0
d 0 1 2

KL-array
3-slow
schedule
HUE = 1/3

KL-array details
In addition to the previous slides the
following issues must be addressed
• For both A and B there are 5 input streams
• How are the matrix values distributed over them?
• For C there are 5 output stream
• How are the resulting values distributed over them?
• How are results that become available at an internal PE
propagated to the border
• How to operate this array for multiple multiplications?
• Flushing old values, can be combined with getting internal
results out.

Summary
1. Systolic architectures are attractive for
implementation media like VLSI circuits
and FPGAs.
2. Starting point for systolic design is a RIA
(or a dependence graph).
3. RIAs can be mapped to systolic arrays in a
systematic fashion.
4. Mapping uses simple linear algebra
techniques.
5. A large variety of designs for a single
problem can be obtained.
Exercise (systolic design)
1. An OCL system is a system that counts (#), for each window of
size on its input stream, the number of times the last received
value occurs in that window, i.e., for 1
# : 0 : ,
where is the input stream and the output stream.
a) Derive a RIA (in standard output form) for this system that satisfies
the equations
, , 0
, # : : ,0
l , , 0
Note that l , , ‼!
b) Draw the dependence graph of this RIA for 4. (you need to
draw only the part with 0 6).

2. Consider the scheduling, projection and processor vector
2 1 0
C , ) , p
1 2 1
a) Construct the systolic array that corresponds to these vectors. You
may assume the existence of a comparator operator = that takes
two input streams and produces an output stream of one’s and
zero’s, for equal and unequal input pairs respectively.
b) Determine the slowness of your design.
3. Assume that the time to perform comparison and addition are

given by Tcmp = 1ns and Tadd = 3ns, respectively. Give the
maximum throughput and the latency of your design (taking
slowness into account). Give the latency both in number of delays
and in real time ( C)

4. Next, replace the scheduling vector by sT = (1, 0). Compare the
throughput and latency of the resulting systolic array with that of the
one with sT = (2, 1).
5. Consider the design of 4.

a) Eliminate redundant operators, and optimize the throughput by
pipelining. Give the resulting throughput and latency.
b) Next retime the result of a), keeping throughput and latency fixed, to
obtain the minimum number of delays.

VLSI Programming Systolic Design: Book Parhi, Chp. 7 Rudolf Mak R.h.mak@tue - NL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VLSI Programming Systolic Design: Book Parhi, Chp. 7 Rudolf Mak R.h.mak@tue - NL

Uploaded by

Copyright:

Available Formats

VLSI programming

18-May-16 Rudolf Mak TU/e Computer Science Systolic 1

18-May-16 Rudolf Mak TU/e Computer Science Systolic 2

Chaining Mealy machines may lead too long critical paths!

“#” stands for “number of”

18-May-16 Rudolf Mak TU/e Computer Science Systolic 4

18-May-16 Rudolf Mak TU/e Computer Science Systolic 5

1. Not all PEs are identical, small deviations are

18-May-16 Rudolf Mak TU/e Computer Science Systolic 6

Such as a dedicated computing engine on a FPGA

18-May-16 Rudolf Mak TU/e Computer Science Systolic 8

18-May-16 Rudolf Mak TU/e Computer Science Systolic 10

Index displacement vectors: LHS = RHS + IDV

18-May-16 Rudolf Mak TU/e Computer Science Systolic 11

& node g ( ( +1 &, '

18-May-16 Rudolf Mak TU/e Computer Science Systolic 12

2. The edges of a dependence graph represent causal

3. There is no notion of time in a dependence graph. It is

18-May-16 Rudolf Mak TU/e Computer Science Systolic 14

x(0) x(1) x(2) x(3) x(4)

y(0) y(1) y(2) y(3) y(4)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 15

x(0) x(1) x(2) x(3) x(4)

y(0) y(1) y(2) y(3) y(4)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 16

18-May-16 Rudolf Mak TU/e Computer Science Systolic 17

y(0) y(1) y(2) y(3) y(4)

1. Choosing a processor space, i.e., a set of dimensions

18-May-16 Rudolf Mak TU/e Computer Science Systolic 19

18-May-16 Rudolf Mak TU/e Computer Science Systolic 20

• Computations that execute on the same PE

18-May-16 Rudolf Mak TU/e Computer Science Systolic 21

x(0) x(1) x(2) x(3) x(4)

y(0) y(1) y(2) y(3) y(4)

y(0) y(1) y(2) y(3) y(4)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 23

Let $D be the time at which is scheduled and $E

Hence, any PE executes at most 1 computation

Note that there are only finitely many fundamental grid

18-May-16 Rudolf Mak TU/e Computer Science Systolic 26

y(i) = h0·x(i) + v(i-1), v(i) = h1·x(i) + u(i-1), u(i) = h2·x(i) + 0

y(i) = h0·x(i) + h1·x(i-1) + h2·x(i-2)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 28

W2 (1, 2) (1, 0) (0, 1) (0, 1,-1) (1, 2,-1) ey = -ey

y(0) y(1) y(2) y(3) y(4)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 31

X(0) X(1) X(2) X(3) X(4)

y(0) y(1) y(2)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 32

X(0) X(1) X(2) X(3) X(4)

y(0) y(1) y(2) y(3) y(4)

18-May-16 Rudolf Mak TU/e Computer Science Systolic 33

18-May-16 Rudolf Mak TU/e Computer Science Systolic 34

At time: 0 HUE = 1 / | ;Q= | = 1 / 2

18-May-16 Rudolf Mak TU/e Computer Science Systolic 35

18-May-16 Rudolf Mak TU/e Computer Science Systolic 36

18-May-16 Rudolf Mak TU/e Computer Science Systolic 37

18-May-16 Rudolf Mak TU/e Computer Science Systolic 48

18-May-16 Rudolf Mak TU/e Computer Science Systolic 49

18-May-16 Rudolf Mak TU/e Computer Science Systolic 53

18-May-16 Rudolf Mak TU/e Computer Science Systolic 54

18-May-16 Rudolf Mak TU/e Computer Science Systolic 55