You are on page 1of 109

Techniques doptimisation architecturale

Camille Diou diou@univ-metz.fr

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Tristate components (inputs/ outputs) BUS


CONTROLLER DATAPATH

State machine

t1 t2 t3 A B C

Register file

Arithmetic and Logic Unit (ALU)

ALU

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH

S=Ax+By+C

t1 t2 t3 A B C

ALU

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH

S=Ax+By+C

t1 t2 t3 A B C

#CYCLES: 1

ALU

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH

S=Ax+By+C

t1 t2 t3 A B C

#CYCLES: 2

ALU

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH A.t1 t1 t2 t3 A B C t1
X

S=Ax+By+C

#CYCLES: 3

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH t3.t1 t1 t2 t3 A B C t1
X

S=Ax+By+C

t3

#CYCLES: 4

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH B.t2 t1 t2 t3 A B C B
X

S=Ax+By+C

t2

#CYCLES: 5

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH t2+t3 t1 t2 t3 A B C t2
+

S=Ax+By+C

t3

#CYCLES: 6

DIOU Camille

Master EAII Sp. RSEE

Microprocessor basics

Computation example :

CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C

DATAPATH t3+C t1 t2 t3 A B C t3
+

S=Ax+By+C

#CYCLES: 7

DIOU Camille

Master EAII Sp. RSEE

10

Microprocessor basics

Execution principle

Fetch Cycle

Execute Cycle

START START

Fetch FetchNext Next Instruction Instruction

Execute Execute Instruction Instruction

HALT HALT

DIOU Camille

Master EAII Sp. RSEE

11

Microprocessor basics
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path

A Single accumulator machine

Data flow Control signals

Load path

ACC A FSM
Function controls

B
Address

Memory

ALU

Opcode

S
incr Branch

MAR
LD

16 bits wide 16M words

IR PC
Address operand Instruction path

DIOU Camille

Master EAII Sp. RSEE

12

Microprocessor basics

Single Address Instruction: one of the registers is fixed (= accumulator)AC is an implicit operand AC:= AC <operation> Memory(Address) Instruction:
15 14 13 0

Address Opcode: 00: Load 01: Store 10: Add 11: Branch

DIOU Camille

Master EAII Sp. RSEE

13

Microprocessor basics
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path Load path ACC A B Address
14 16

Memory

FSM
Opcode
2

ALU
Function controls

S LD IR Branch
DIOU Camille

16 bits wide 16M words


MAR

incr
14 14 16

PC Address operand Instruction path

Master EAII Sp. RSEE

14

Microprocessor basics
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path Load path ACC A B Address
14 16

1. Instruction fetch: - PC is moved into MAR - Read from memory - Load instruction into IR 2. Instruction decode: - Op code bits to FSM(ADD) - rest of bits is operand addr.

Memory
1000110100110011

FSM
Opcode
2

ALU
Function controls

S LD
1000110100110011

10110100110011

16 bits wide 16M words

MAR
10110100110011

IR Branch
DIOU Camille

incr
14 14 16

PC Address operand Instruction path

Master EAII Sp. RSEE

15

Microprocessor basics
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path Load path
1000100011100111 16

3. Operand Fetch: - IR<address> -> MAR - Read data from memory 4. Instr. Execute - Memory to ALU B - AC to ALU - ALU Add - S to AC

ACC A
0011001101110110

B
0101010101110001

Memory
0101010101110001

FSM
Opcode
2

ALU
Function controls 1000100011100111

Address
14 00110100110011

S LD
1000110100110011

16 bits wide 16M words

MAR
10110100110011

incr Branch
DIOU Camille

14 14 16

PC Address operand Instruction path

Master EAII Sp. RSEE

16

Microprocessor basics
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path Load path
1000100011100111 16

5. Housekeeping: - Increment PC

ACC A
0011001101110110

B
0101010101110001

Memory
0101010101110001

FSM
Opcode
2

ALU
Function controls 1000100011100111

Address
14 00110100110011

S LD
1000110100110011

16 bits wide 16M words

MAR
10110100110011 10110100110100

incr Branch
DIOU Camille

14 14 16

PC Address operand Instruction path

Master EAII Sp. RSEE

17

Microprocessor basics

A simple microprocessor : Architecture


To controller (FSM) To controller (FSM)
16x16 registers

DIOU Camille

data to/from memory

Master EAII Sp. RSEE

Adress to memory

18

Microprocessor basics

A simple microprocessor : Instruction format

shift
or or or

DIOU Camille

Master EAII Sp. RSEE

19

Microprocessor basics
Instruction Instruction format Action

A simple microprocessor : Instruction format

DIOU Camille

Master EAII Sp. RSEE

20

Microprocessor basics

A simple microprocessor : Instruction format

DIOU Camille

Master EAII Sp. RSEE

21

Microprocessor basics

A simple microprocessor : test program


What will it do ?
0000 7C0A 0001 8C00 0002 7B04 0003 7A0A 0004 9C7C 0005 611A 0006 614B ; ; ; LOAD RC, #A ...

; ... ; ... ; ... ; ...

...

DIOU Camille

Master EAII Sp. RSEE

22

Microprocessor basics

Compiler dependancies detection for ILP

Detect data dependency at compile time:


examples:
c[i]=a[i]+b[i]; d[i]=a[i]+c[j]; c[1]=a[i]+b[i]; d[i]=a[i]+c[2]; potential dependency c[i] might be c[j] no dependency c[1] is never c[2]

DIOU Camille

Master EAII Sp. RSEE

23

Systolic ring

Reconfigurable computing : Instruction level parallelism (ILP)

Superscalar processors must find dataflow graph at run time


Reconfigurable architectures constructs data flow graph at compile time No FU limitations No control logic overhead No window size limitations
DIOU Camille Master EAII Sp. RSEE

24

Systolic ring

Reconfigurable computing : Instruction level parallelism (ILP)


RC scheme: General Purpose Computer
add r1, r2, r4 add r1, r3, r5 sub r3, r2, r6 add r4 r5 r1 add r5 r6 r2

r1

r2

r3 r1

r3

r2

r4 r1

r5 r2

r6

Question: what is the advantage of RC against superscalar? Answer: Dataflow graph constructed at compile time, thus, no overhead
DIOU Camille Master EAII Sp. RSEE

25

Systolic ring

Reconfigurable computing : Why now ?


Increasing number of transistors Complexity and cost of chip design increase fast Current computing demands are RC friendly : Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications

DIOU Camille

Master EAII Sp. RSEE

26

Systolic ring

RA versus microprocessors
RA less flexible (like a VLIW with fixed instructions)

but
RA provides more (customized) computation elements RA can decrease memory traffic RA can be tailored for specific algorithms and data types

RA will not replace P, but complement them

DIOU Camille

Master EAII Sp. RSEE

27

Systolic ring

Systolic computing : definition


A set of simple processing elements with regular and local connections which takes external inputs and processes them in a predertermined manner in a determined fashion H.T. Kung

DIOU Camille

Master EAII Sp. RSEE

28

Systolic ring

Systolic computing : characteristics of best RC design


Simple PE Regular and local interconnect Pipeline between Pes I/O at boundary

DIOU Camille

Master EAII Sp. RSEE

29

Systolic ring

Coarse grain RA model

In abstract : Instructions configure both PE and interconnect every cycle In reality : Instruction Bandwidth / Memory too high, so COMPROMISE
DIOU Camille Master EAII Sp. RSEE

30

Systolic ring

Communications Relationship of communication among processors Shared clock (Pipelined) Shared registers (VLIW) Shared memory (SMM) Shared network

DIOU Camille

Master EAII Sp. RSEE

31

Systolic ring

Reconfigurable computing Actual available hardware Instructions currently in hardware


ram

DIOU Camille

Pro g

Instructions paged out

Master EAII Sp. RSEE

32

Systolic ring

Finite Impulse response filter (FIR)

y(n)=a(i)x(ni1)
xn aN
Z
-1

N 1 i=0

aN-1
Z
-1

aN-2

a1
Z
-1

a0
-1

yn

3 coefficients filter

y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)
xn a2
Z
-1

a1
Z
-1

a0
Z
-1

yn

DIOU Camille

Master EAII Sp. RSEE

33

Systolic ring

Systolic FIR implementation

(MAC unit)

DIOU Camille

Master EAII Sp. RSEE

34

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

35

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

36

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

37

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

38

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

39

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

40

Systolic ring

Systolic FIR implementation

Optimize outer loop, preload-repeated value

DIOU Camille

Master EAII Sp. RSEE

41

Systolic ring

Systolic FIR implementation

Optimize outer loop, broadcast common value

DIOU Camille

Master EAII Sp. RSEE

42

Systolic ring

Systolic FIR implementation

Optimize outer loop, retime to eliminate broadcast

DIOU Camille

Master EAII Sp. RSEE

43

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

44

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

45

Systolic ring

Systolic FIR implementation

DIOU Camille

Master EAII Sp. RSEE

46

Systolic ring

The Systolic Ring Coarse grain architecture Multi-mode dynamical reconfiguration Scalable, bidimentionnal array VHDL design Designed for SoC integration

DIOU Camille

Master EAII Sp. RSEE

47

2
Constitution

Systolic ring

Dnode : word-level processing unit


Optimized Datapath (16 bits)

inst.

Register File (4x16bits) Hardwired ALU and multiplier

Reg FILE

Features
Complex computations in local mode (FIR,IIR, WT) Low silicon area (0.07mm, 0.18m CMOS process) Single-cycle operations (ex:MAC+register load)

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

48

2
Constitution

Systolic ring

Local controller : Dynamical reconfiguration at the Dnode level


8 configuration registers 3 differents run modes 1 programming mode
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 Mux reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

49

Systolic ring

Programming mode

clk
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 Mux reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

50

Systolic ring

Programming mode

clk
Instruction 0
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 Mux reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

51

Systolic ring

Programming mode

clk
Instruction 1
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 Mux reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

52

Systolic ring

Programming mode

clk
Instruction 2
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2 Mux reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

53

Systolic ring

Programming mode

clk
Instruction 3
reg0 reg1 inhib

In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

54

Systolic ring

Run-mode 1 : Fixed

clk
reg0 reg1 inhib

Instruction 0
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

55

Systolic ring

Run-mode 1 : Fixed

clk
reg0 reg1 inhib

Instruction 0
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

56

Systolic ring

Run-mode 1 : Fixed

clk
reg0 reg1 inhib

Instruction 0
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

57

Systolic ring

Run-mode 1 : Fixed

clk
reg0 reg1 inhib

Instruction 0
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

58

Systolic ring

Run-mode 2 : Dynamic

clk
reg0 reg1 Inhib

Instruction 1
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

59

Systolic ring

Run-mode 2 : Dynamic

clk
reg0 reg1 inhib

Instruction 2
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

60

Systolic ring

Run-mode 2 : Dynamic (one-time or loop)

clk
reg0 reg1 inhib

Instruction 3
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)

Reg2 Mux Reg3 Mux reg4

inst.

Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait

ALU + MULT

DIOU Camille

Master EAII Sp. RSEE

out

61

Systolic ring
Scalable

Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations
Units de Configurable traitement blocks Flots de donnes dataflow (unidirectional) Switchs Main UNIDIRECTIONNELS

INPUTS ENTRES

BUS : Shared resources BUS : ressource PARTAGE


DIOU Camille Master EAII Sp. RSEE

SORTIES OUTPUTS

62

Systolic ring

Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations
Units de Configurable traitement blocks Flots de donnes dataflow (unidirectional) Switchs Main UNIDIRECTIONNELS

Use of a Ring structure

RING STRUCTURE

INPUTS ENTRES

BUS : Shared resources BUS : ressource PARTAGE


DIOU Camille Master EAII Sp. RSEE

SORTIES OUTPUTS

63

Systolic ring

Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations Use of a bi-dataflows structure

RING STRUCTURE
Forward
Units de Configurable traitement blocks Flots de donnes dataflow (unidirectional) Switchs Main UNIDIRECTIONNELS

Dataflow

INPUTS ENTRES

SORTIES OUTPUTS

Reverse Dataflow
DIOU Camille

BUS : Shared resources BUS : ressource PARTAGE

Master EAII Sp. RSEE

64

Systolic ring
Forward dataflow

Systolic Ring architecture


Dnode
Switch

Peak power : 3200 MIPS@200MHz (16 Dnodes version)


Dnode
Switch

E/S

Switch

E/S

Dnode

Dnode

Dnode
E/S

Dnode

Dnode

Dnode

Couche n
E/S

Switch

Flot de donnes

Switch

Dnode

Dnode

Dnode

Dnode

Couche n+1

Dnode
Switch
Switch

Dnode
Switch

E/S

E/S

Dnode
DIOU Camille

Dnode
Master EAII Sp. RSEE

65

Systolic ring
Forward dataflow

Systolic Ring architecture

No complex data routing problems (crossbars)


Unidirectional data transfers between adjacent layers (pipeline) Linear performances increase with Dnode number Provides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization
Forward Dataflow
D-Node D node

I/O

Layer n-1
D-Node

Switch components: Direct FIFO connection for Data injection BUS connection for RISC communication Full connectivity between 2 Dnode layers

Switch

I/O

Switch

Switch

I/O

D-Node D node

D-Node

D-Node Node

D-Node

D-Node

D-Node

Layer n
I/O

I/O

Switch

Config. controller

Switch

D-Node

D-Node

D-Node

D-Node D node

Layer n+1

D node

Local mode : stand-alone Global mode : FPGA like DIOU


Camille

D-Node
Switch

D-Node
Switch

I/O

Switch

I/O

D node

D-Node D node

D-Node

Master EAII Sp. RSEE

I/O

66

Systolic ring
Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switchs pipelines Easy implementation of various recursive algorithms (IIR, WT)
D-Node
Switch Switch

Feedback pipelines

D-Node
Switch

D-Node

D-Node

D-Node Node

D-Node

D-Node

D-Node

Switch

Switch

D-Node

D-Node

D-Node

D-Node

D-Node
Switch Switch

D-Node

DIOU Camille

Switch

D-Node

Master EAII D-Node Sp. RSEE

67

Systolic ring
Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switchs pipelines Easy implementation of various recursive algorithms (IIR, WT)
D-Node
Switch Switch

Feedback pipelines

D-Node
Switch

D-Node

D-Node

D-Node Node

D-Node

D-Node

D-Node

Switch

Switch

D-Node

D-Node

D-Node

D-Node

D-Node
Switch Switch

D-Node

DIOU Camille

Switch

D-Node

Master EAII D-Node Sp. RSEE

68

Systolic ring
Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switchs pipelines Easy implementation of various recursive algorithms (IIR, WT)
D-Node
Switch Switch

Feedback pipelines

D-Node
Switch

D-Node

D-Node

D-Node Node

D-Node

D-Node

D-Node

Switch

Switch

D-Node

D-Node

D-Node

D-Node

D-Node
Switch Switch

D-Node

DIOU Camille

Switch

D-Node

Master EAII D-Node Sp. RSEE

69

Systolic ring

2 levels dynamically reconfigurable architecture:


Global mode (first level)
The program which manages the configuration runs on the RISC processor 1 The configuration of an entire cluster can be modified at each clock cycle 2 The operating layer computes the data coming from the host processor 3

Local mode (second level) Each Dnode runs his own up-to-8 instructions program
OPERATING layer
+

Dnode
A B
Reg FILE AL U +M ULT

CONFIGURATION layer 2

* RAM

3
DATA

Host P

CONFIG

DIOU Camille

Config Controller

MANAGEMENT CODE Master EAII Sp. RSEE

70

Systolic ring
ST* CMOS process 0.25 m & 0.18 m

8 Dnodes version
Features :
Parametrizable core (number of Dnodes) Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system) 1600 MIPS (PII@450MHz : 400 MIPS) 3 Gb/s bandwidth

Ring-8 0.25 m Area Frquency


Low Dnode area

Ring-8 0.18 m
0.7 mm2 200 MHz

Dnode 0.18 m
0.04 mm2 200 MHz

0.9 mm2 150 MHz

Possible to realize 128 Dnodes versions

Suited as an IP core for SoC


*: ST: STmicroelectronics
DIOU Camille Master EAII Sp. RSEE

71

Systolic ring

Assembly-level programming
RISC 0000instructions r:ldl(0,8) Layer M1: selection N1:clr

N2:clr

Dnodes instructions

0001 0002 0003 0004

r:ldl(1,2) M2: N1:clr N2:clr r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) r:jnz(1) M2: N1:mac(in1) N2:mac(in2) r: halt

Assembler

Prototype
File1.bin

RAM

FPGA

Simulator
Testbench
File2.m
DIOU Camille

RAM

Master EAII Sp. RSEE

Ring-8

72

Systolic ring
[ -1 1 0 ]

RIF filter : edge detection


Convolution mask : Assembly code
0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)

yn=xn-xn-1.
Timing diagrams

Assembler

Simulator
Testbench
File2.m

RAM

Ring-8

Input image

Output image

DIOU Camille

Master EAII Sp. RSEE

73

Systolic ring

Polynomial calculus
P(x)=a.x+b.x+c.x3
x x

reg0

/* load reg0,x */ /* load reg1,x */

x.x

reg1

x x

2 1

reg0.reg1

reg2

/* load reg1,x3 */

3 1

a.reg0

ACC x ALU + MULT x

/* load ACC,a.x */

4 1

b.reg1 + ACC

ACC

/* load ACC,a.x+b.x */

5 1

c.reg2 + ACC

ACC

/* load ACC,a.x+b.x+c.x3 */

DIOU Camille

Master EAII Sp. RSEE

74

Systolic ring

Polynomial calculus
P(x)=a.x+b.x+c.x3
x

reg0

/* load reg0,x */ /* load reg1,x */

x.x

reg1

x x x3

2 1

reg0.reg1

reg2

/* load reg1,x3 */

3 1

a.reg0

ACC x ALU + MULT x

/* load ACC,a.x */

4 1

b.reg1 + ACC

ACC

/* load ACC,a.x+b.x */

5 1

c.reg2 + ACC

ACC

/* load ACC,a.x+b.x+c.x3 */

DIOU Camille

Master EAII Sp. RSEE

75

Systolic ring

Polynomial calculus
P(x)=a.x+b.x+c.x3
a x

reg0

/* load reg0,x */ /* load reg1,x */

x.x

reg1

x x x3

2 1

reg0.reg1

reg2

/* load reg1,x3 */

3 1

a.reg0

ACC a ALU + MULT x

/* load ACC,a.x */

4 1

b.reg1 + ACC

ACC a.x

/* load ACC,a.x+b.x */

5 1

c.reg2 + ACC

ACC

/* load ACC,a.x+b.x+c.x3 */

DIOU Camille

Master EAII Sp. RSEE

76

Systolic ring

Polynomial calculus
P(x)=a.x+b.x+c.x3
b x

reg0

/* load reg0,x */ /* load reg1,x */

x.x

reg1

x x x3

2 1

reg0.reg1

reg2

/* load reg1,x3 */

3 1

a.reg0

ACC b ALU + MULT x

/* load ACC,a.x */

4 1

b.reg1 + ACC

ACC a.x+b.x

/* load ACC,a.x+b.x */

5 1

c.reg2 + ACC

ACC

/* load ACC,a.x+b.x+c.x3 */

DIOU Camille

Master EAII Sp. RSEE

77

Systolic ring

Polynomial calculus
P(x)=a.x+b.x+c.x3
c x

reg0

/* load reg0,x */ /* load reg1,x */

x.x

reg1

x x x3

2 1

reg0.reg1

reg2

/* load reg1,x3 */

3 1

a.reg0

ACC c ALU + MULT x3

/* load ACC,a.x */

4 1

b.reg1 + ACC

ACC a.x+b.x+c. x3

/* load ACC,a.x+b.x */

5 1

c.reg2 + ACC

ACC

/* load ACC,a.x+b.x+c.x3 */

DIOU Camille

Master EAII Sp. RSEE

78

Systolic ring

Finite Impulse response filter (FIR)

y(n)=ai x(ni1)
i =0
Z

N 1

xn

-1

-1

-1

-1

a0

a1

a2

aN-1

aN yn

DIOU Camille

Master EAII Sp. RSEE

79

Systolic ring
y(n)=a(i)x(ni1)
xn aN
Z
-1

Finite Impulse response filter (FIR)


N 1 i=0

aN-1
Z
-1

aN-2

a1
Z
-1

a0
-1

yn

3 coefficients filter

y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)
xn a2
Z
-1

a1
Z
-1

a0
Z
-1

yn

DIOU Camille

Master EAII Sp. RSEE

80

Systolic ring

FIR implementation
3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
x0, x0, x0 a2, a1, a0 x0
a2 a1 a2

Cycle 1

DIOU Camille

Master EAII Sp. RSEE

81

Systolic ring

FIR implementation
3 Dnodes / layer architecture use
Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle Feedback

x1, x1, x1 a2.x0 x1


a2

x1
a1

MAC

Cycle 2

a2.x0

DIOU Camille

Master EAII Sp. RSEE

82

Systolic ring

FIR implementation
3 Dnodes / layer architecture use
Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle Feedback

x2, x2, x2 a2.x1 x2


a2

x2
a1

a2.x0+a1.x1 x2
a0

MAC

MAC

Cycle 3

a2.x1

a2.x0+a1.x1

DIOU Camille

Master EAII Sp. RSEE

83

Systolic ring

FIR implementation
3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
Feedback

x3, x3, x3 a2.x2 x3


a2

x3
a1

a2.x1+a1.x2 x3
a0

MAC

MAC

Cycle 4

a2.x2

a2.x1+a1.x2 a2.x0+a1.x1 +a0.x2

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Camille Master EAII Sp. RSEE

84

Systolic ring

3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
Feedback

x4, x4, x4 a2.x3 x4


a2

x4
a1

a2.x2+a1.x3 x4
a0

MAC

MAC

Cycle 5

a2.x3

a2.x2+a1.x3 a2.x1+a1.x2 +a0.x3

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Camille Master EAII Sp. RSEE

85

Systolic ring

3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
Feedback

x4, x4, x4 a2.x3 x4


a2

x4
a1

a2.x2+a1.x3 x4
a0

MAC

MAC

Cycle 6

a2.x3

a2.x2+a1.x3 a2.x1+a1.x2 +a0.x3

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Camille Master EAII Sp. RSEE

86

Systolic ring

3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
Feedback

x5, x5, x5 a2.x4 x4


a2

x5
a1

a2.x3+a1.x4 x5
a0

MAC

MAC

Cycle 7

a2.x4

a2.x3+a1.x4 a2.x2+a1.x3 +a0.x4

OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE


DIOU Camille Master EAII Sp. RSEE

87

Systolic ring

6 coefficients filter

y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)+a3.x(n4)+a4.x(n5)+a5.x(n6)
xn

a2

a1

a0

MAC

MAC

MAC

Inter-layers feedback

xn

yn

a5

a4

a3

MAC

MAC

DIOU Camille

Master EAII Sp. RSEE

88

Systolic ring

Discrete Cosine Transform


Usually bidimensional 8x8 points DCT Very demanding algorithm
Original image

DCT

Quantification

Coding

DCT Coeff.

Quantified Coeff. inverse Quantification

Compressed image

iDCT

Decoding

Decompressed image
DIOU Camille Master EAII Sp. RSEE

89

Systolic ring

DCT algorithm
Direct transform
N 1 (2 n+1)k 2 (k ) xn cos zk = 2N N n =0

k = 0,1,,N-1

Inverse transform

xn = 2 N

(2 n +1)k ( ) k z cos k
N 1 k =0

2N

n = 0,1,,N-1

with

(k ) =

1/2 for k = 0 1 else


Master EAII Sp. RSEE

DIOU Camille

90

2
Image

Systolic ring

64x64 points 8x8 pixels blocks 16 bits coded image


64
x0, 0 x 1, 0 . . . . . x7 , 0 x0,1 x1,1 . . . .

. . . . . . . . .

64

x0, 7 . . . . . . x7 , 7

64 blocs 8x8
Image initiale

DIOU Camille

Master EAII Sp. RSEE

91

Systolic ring

Implementation
Matrix implementation Even / Odd frequency decomposition of the DCT algorithm
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4 x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z x x 3 4 7

= cos (/4) = cos (/8) = sin (/8) = cos (/16) = cos (3/16) = sin (3/16) = sin (/16)

z=

2 T (N )x N

DIOU Camille

Master EAII Sp. RSEE

92

Systolic ring

Coefficients coding
Fixed point
N 1 n=0

zk =

2 (k N

(2 n + 1 )k x n cos 2 N

Example : n=6

= = = = = = =

0000000000010110 - = 1111111111101010 0000000000011101 - = 1111111111100011 0000000000001100 - = 1111111111110100 0000000000011111 - = 1111111111100001 0000000000011010 - = 1111111111100101 0000000000010001 - = 1111111111101111 0000000000000110 - = 1111111111111010
Master EAII Sp. RSEE

DIOU Camille

93

Systolic ring

Implementation :
ADD and SUB on the first Dnode layer Multiply-accumulate operations (MAC) on the second Dnodes layer
Dnode1

xn x(N-1)-n xn x(N-1)-n

+ _
Dnode2

xn + x(N-1)-n

Dnode1

z0 , z2 , z4 , z6

MAC
xn - x(N-1)-n

MAC
Dnode2

z1 , z3 , z5 , z7

Config

Coefficients

DIOU Camille

Master EAII Sp. RSEE

Config

94

Systolic ring
t=0
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

n=0

Dnode1

Dnode1

+ _
x0 x7 x0 x7
Dnode2 Dnode2

Config
DIOU Camille

M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)


Master EAII Sp. RSEE

95

Systolic ring
t=1
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

n=1

Dnode1

+ _
x1 x6 x1 x6
Dnode2

x0 + x7

Dnode1

MAC
x0 x7

MAC
,x,1,x
Dnode2

M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)


DIOU Camille Master EAII Sp. RSEE

Config

96

Systolic ring
t=2
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

n=2

Dnode1

+ _
x2 x5 x2 x5
Dnode2

x1 + x6

Dnode1

MAC
x1 x6

MAC
,x,1,x
Dnode2

DIOU Camille

Master EAII Sp. RSEE

97

2
0

Systolic ring
t=3
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing 1 1 z

n=3

1 1 x0 + x7 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

Dnode1

+ _
x3 x4 x3 x4
Dnode2

x2 + x5

Dnode1

MAC
x2 x5

MAC
,x,1,x
Dnode2

DIOU Camille

Master EAII Sp. RSEE

98

Systolic ring
t=4
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

n=4

Dnode1

+ _
Dnode2

x3 + x4

Dnode1

MAC
x3 x4

MAC
,x,1,x
Dnode2

DIOU Camille

Master EAII Sp. RSEE

99

Systolic ring
t=5
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

n=0

Dnode1

+ _
Dnode2

x3 + x4

Dnode1

clear
x3 x4

z0

clear
,x,1,x
Dnode2

z1

M1: N1:clear N2: clear


DIOU Camille Master EAII Sp. RSEE

Config

100

Systolic ring
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4

Computing
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4

Results

2 transforms issued each 5 machine cycles Clear performed during addition 20 cycles for 8 samples

DIOU Camille

Master EAII Sp. RSEE

101

Systolic ring
M0

Achievable parallelisn on a 8 Dnodes structures : Ring-8


DCT 1D - 4 first lines
Switch Dnode 1

Dnode 2

Config

M3

Dnode 1

Dnode 2

Switch

Config

Config Dnode 2 Dnode 1

M1

Config

Switch

DCT 1D - 4 last lines


DIOU Camille

Dnode 1 M2

Switch Master EAII Sp. RSEE

Dnode 2

102

Systolic ring

Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7

5 cycles

2 partial transforms

DIOU Camille

Master EAII Sp. RSEE

103

Systolic ring

Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7

20 cycles

1 Line 8 partial transforms

DIOU Camille

Master EAII Sp. RSEE

104

Systolic ring

Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7

80 cycles
M0 M1

4 Lines - 32 partial transforms

DIOU Camille

Master EAII Sp. RSEE

105

Systolic ring

Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7

80 cycles
M0 M1

4 Lines - 32 partial transforms

DIOU Camille

Master EAII Sp. RSEE

106

Systolic ring
80 cycles
M2
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . .

Overall performances

M3
. . z '0 , 7 . . . . . . z '7 , 7

8 Columns - 64 transforms

DIOU Camille

Master EAII Sp. RSEE

107

Systolic ring

Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7

DCT 2D sur 8 points : 160 CYCLES


DIOU Camille Master EAII Sp. RSEE

108

Systolic ring

Comparisons : execution time (cycles)


VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830
400 350 300

# cycles

250 200 150 100 50 0 CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII
Master EAII Sp. RSEE

NEC V830

DIOU Camille

VLIW

Superscalar

109

You might also like