Techniques D'optimisation Architecturale

Techniques doptimisation architecturale
Camille Diou diou@univ-metz.fr
DIOU Camille
Master EAII Sp. RSEE
Microprocessor basics
Tristate components (inputs/ outputs) BUS

CONTROLLER DATAPATH
State machine
t1 t2 t3 A B C
Register file
Arithmetic and Logic Unit (ALU)
ALU
DIOU Camille
Computation example :
CONTROLLER
t1 <- x t2 <- y t3 <- A.t1 t3 <- t3.t1 t2 <- B.t2 t3 <- t2+t3 out<- t3+C
DATAPATH
S=Ax+By+C
t1 t2 t3 A B C
ALU
DIOU Camille
CONTROLLER
DATAPATH
S=Ax+By+C
t1 t2 t3 A B C
#CYCLES: 1
ALU
DIOU Camille
CONTROLLER
DATAPATH
S=Ax+By+C
t1 t2 t3 A B C
#CYCLES: 2
ALU
DIOU Camille
CONTROLLER
DATAPATH A.t1 t1 t2 t3 A B C t1
X
S=Ax+By+C
#CYCLES: 3
DIOU Camille
CONTROLLER
DATAPATH t3.t1 t1 t2 t3 A B C t1
X
S=Ax+By+C
t3
#CYCLES: 4
DIOU Camille
CONTROLLER
DATAPATH B.t2 t1 t2 t3 A B C B
X
S=Ax+By+C
t2
#CYCLES: 5
DIOU Camille
CONTROLLER
DATAPATH t2+t3 t1 t2 t3 A B C t2
+
S=Ax+By+C
t3
#CYCLES: 6
DIOU Camille
CONTROLLER
DATAPATH t3+C t1 t2 t3 A B C t3
+
S=Ax+By+C
#CYCLES: 7
DIOU Camille
10
Execution principle
Fetch Cycle
Execute Cycle
START START
Fetch FetchNext Next Instruction Instruction
Execute Execute Instruction Instruction
HALT HALT
DIOU Camille
11
MAR : Memory Adress Register IR : Instruction Register PC : Program Counter register
Store path
A Single accumulator machine
Data flow Control signals
Load path
ACC A FSM
Function controls
B
Address
Memory
ALU
Opcode
S
incr Branch
MAR
LD
16 bits wide 16M words
IR PC
Address operand Instruction path
DIOU Camille
12
Single Address Instruction: one of the registers is fixed (= accumulator)AC is an implicit operand AC:= AC <operation> Memory(Address) Instruction:
15 14 13 0
Address Opcode: 00: Load 01: Store 10: Add 11: Branch
DIOU Camille
13
Store path Load path ACC A B Address
14 16
Memory
FSM
Opcode
2
ALU
Function controls
S LD IR Branch
DIOU Camille

MAR
incr
14 14 16
PC Address operand Instruction path
14
Store path Load path ACC A B Address
14 16
1. Instruction fetch: - PC is moved into MAR - Read from memory - Load instruction into IR 2. Instruction decode: - Op code bits to FSM(ADD) - rest of bits is operand addr.
Memory
1000110100110011
FSM
Opcode
2
ALU
Function controls
S LD
1000110100110011
10110100110011
MAR
10110100110011
IR Branch
DIOU Camille
incr
14 14 16
15
Store path Load path
1000100011100111 16
3. Operand Fetch: - IR<address> -> MAR - Read data from memory 4. Instr. Execute - Memory to ALU B - AC to ALU - ALU Add - S to AC
ACC A
0011001101110110
B
0101010101110001
Memory
0101010101110001
FSM
Opcode
2
ALU
Function controls 1000100011100111
Address
14 00110100110011
S LD
1000110100110011
MAR
10110100110011
incr Branch
DIOU Camille
14 14 16
16
Store path Load path
1000100011100111 16
5. Housekeeping: - Increment PC
ACC A
0011001101110110
B
0101010101110001
Memory
0101010101110001
FSM
Opcode
2
ALU
Function controls 1000100011100111
Address
14 00110100110011
S LD
1000110100110011
MAR
10110100110011 10110100110100
incr Branch
DIOU Camille
14 14 16
17
A simple microprocessor : Architecture

To controller (FSM) To controller (FSM)
16x16 registers
DIOU Camille
data to/from memory
Adress to memory
18
A simple microprocessor : Instruction format
shift
or or or
DIOU Camille
19
Instruction Instruction format Action
DIOU Camille
20
DIOU Camille
21
A simple microprocessor : test program

What will it do ?
0000 7C0A 0001 8C00 0002 7B04 0003 7A0A 0004 9C7C 0005 611A 0006 614B ; ; ; LOAD RC, #A ...
; ... ; ... ; ... ; ...
...
DIOU Camille
22
Compiler dependancies detection for ILP
Detect data dependency at compile time:

examples:
c[i]=a[i]+b[i]; d[i]=a[i]+c[j]; c[1]=a[i]+b[i]; d[i]=a[i]+c[2]; potential dependency c[i] might be c[j] no dependency c[1] is never c[2]
DIOU Camille
23
Systolic ring
Reconfigurable computing : Instruction level parallelism (ILP)
Superscalar processors must find dataflow graph at run time

Reconfigurable architectures constructs data flow graph at compile time No FU limitations No control logic overhead No window size limitations
DIOU Camille Master EAII Sp. RSEE
24
Systolic ring
Reconfigurable computing : Instruction level parallelism (ILP)

RC scheme: General Purpose Computer
add r1, r2, r4 add r1, r3, r5 sub r3, r2, r6 add r4 r5 r1 add r5 r6 r2
r1
r2
r3 r1
r3
r2
r4 r1
r5 r2
r6
Question: what is the advantage of RC against superscalar? Answer: Dataflow graph constructed at compile time, thus, no overhead
25
Systolic ring
Reconfigurable computing : Why now ?

Increasing number of transistors Complexity and cost of chip design increase fast Current computing demands are RC friendly : Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications
DIOU Camille
26
Systolic ring
RA versus microprocessors
RA less flexible (like a VLIW with fixed instructions)
but
RA provides more (customized) computation elements RA can decrease memory traffic RA can be tailored for specific algorithms and data types
RA will not replace P, but complement them
DIOU Camille
27
Systolic ring
Systolic computing : definition

A set of simple processing elements with regular and local connections which takes external inputs and processes them in a predertermined manner in a determined fashion H.T. Kung
DIOU Camille
28
Systolic ring
Systolic computing : characteristics of best RC design

Simple PE Regular and local interconnect Pipeline between Pes I/O at boundary
DIOU Camille
29
Systolic ring
Coarse grain RA model
In abstract : Instructions configure both PE and interconnect every cycle In reality : Instruction Bandwidth / Memory too high, so COMPROMISE
30
Systolic ring
Communications Relationship of communication among processors Shared clock (Pipelined) Shared registers (VLIW) Shared memory (SMM) Shared network
DIOU Camille
31
Systolic ring
Reconfigurable computing Actual available hardware Instructions currently in hardware

ram
DIOU Camille
Pro g
Instructions paged out
32
Systolic ring
Finite Impulse response filter (FIR)
y(n)=a(i)x(ni1)
xn aN
Z
-1
N 1 i=0
aN-1
Z
-1
aN-2
a1
Z
-1
a0
-1
yn
3 coefficients filter
y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)
xn a2
Z
-1
a1
Z
-1
a0
Z
-1
yn
DIOU Camille
33
Systolic ring
Systolic FIR implementation
(MAC unit)
DIOU Camille
34
Systolic ring
DIOU Camille
35
Systolic ring
DIOU Camille
36
Systolic ring
DIOU Camille
37
Systolic ring
DIOU Camille
38
Systolic ring
DIOU Camille
39
Systolic ring
DIOU Camille
40
Systolic ring
Optimize outer loop, preload-repeated value
DIOU Camille
41
Systolic ring
Optimize outer loop, broadcast common value
DIOU Camille
42
Systolic ring
Optimize outer loop, retime to eliminate broadcast
DIOU Camille
43
Systolic ring
DIOU Camille
44
Systolic ring
DIOU Camille
45
Systolic ring
DIOU Camille
46
Systolic ring
The Systolic Ring Coarse grain architecture Multi-mode dynamical reconfiguration Scalable, bidimentionnal array VHDL design Designed for SoC integration
DIOU Camille
47
2
Constitution
Systolic ring
Dnode : word-level processing unit

Optimized Datapath (16 bits)
inst.
Register File (4x16bits) Hardwired ALU and multiplier
Reg FILE
Features
Complex computations in local mode (FIR,IIR, WT) Low silicon area (0.07mm, 0.18m CMOS process) Single-cycle operations (ex:MAC+register load)
ALU + MULT
DIOU Camille
48
2
Constitution
Systolic ring
Local controller : Dynamical reconfiguration at the Dnode level

8 configuration registers 3 differents run modes 1 programming mode
reg0 reg1 inhib
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
reg2 Mux reg3 Mux reg4
inst.
Reg FILE
reg5 reg6 reg7 ck mode wait enex 2 3 Controller Decoder 8 wait
ALU + MULT
DIOU Camille
out
49
Systolic ring
Programming mode
clk
reg0 reg1 inhib
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
50
Systolic ring
Programming mode
clk
Instruction 0
reg0 reg1 inhib
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
51
Systolic ring
Programming mode
clk
Instruction 1
reg0 reg1 inhib
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
52
Systolic ring
Programming mode
clk
Instruction 2
reg0 reg1 inhib
Reg2 Mux reg3 Mux reg4
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
53
Systolic ring
Programming mode
clk
Instruction 3
reg0 reg1 inhib
Reg2 Mux Reg3 Mux reg4
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
54
Systolic ring
Run-mode 1 : Fixed
clk
reg0 reg1 inhib
Instruction 0
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
55
Systolic ring
Run-mode 1 : Fixed
clk
reg0 reg1 inhib
Instruction 0
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
56
Systolic ring
Run-mode 1 : Fixed
clk
reg0 reg1 inhib
Instruction 0
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
57
Systolic ring
Run-mode 1 : Fixed
clk
reg0 reg1 inhib
Instruction 0
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
58
Systolic ring
Run-mode 2 : Dynamic
clk
reg0 reg1 Inhib
Instruction 1
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
59
Systolic ring
Run-mode 2 : Dynamic
clk
reg0 reg1 inhib
Instruction 2
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
60
Systolic ring
Run-mode 2 : Dynamic (one-time or loop)
clk
reg0 reg1 inhib
Instruction 3
inst.
Reg FILE
ALU + MULT
DIOU Camille
out
61
Systolic ring
Scalable
Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations
Units de Configurable traitement blocks Flots de donnes dataflow (unidirectional) Switchs Main UNIDIRECTIONNELS
INPUTS ENTRES
BUS : Shared resources BUS : ressource PARTAGE

SORTIES OUTPUTS
62
Systolic ring
Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations
Use of a Ring structure
RING STRUCTURE
INPUTS ENTRES

SORTIES OUTPUTS
63
Systolic ring
Array structure
Unidirectional communications between neighbours Hard to implement datapath with greater pipeline depth than the array Hard to implement recursive operations Use of a bi-dataflows structure
RING STRUCTURE
Forward
Dataflow
INPUTS ENTRES
SORTIES OUTPUTS
Reverse Dataflow
DIOU Camille
64
Systolic ring
Forward dataflow
Systolic Ring architecture

Dnode
Switch
Peak power : 3200 MIPS@200MHz (16 Dnodes version)

Dnode
Switch
E/S
Switch
E/S
Dnode
Dnode
Dnode
E/S
Dnode
Dnode
Dnode
Couche n
E/S
Switch
Flot de donnes
Switch
Dnode
Dnode
Dnode
Dnode
Couche n+1
Dnode
Switch
Switch
Dnode
Switch
E/S
E/S
Dnode
DIOU Camille
Dnode
65
Systolic ring
Forward dataflow
Systolic Ring architecture
No complex data routing problems (crossbars)

Unidirectional data transfers between adjacent layers (pipeline) Linear performances increase with Dnode number Provides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization
Forward Dataflow
D-Node D node
I/O
Layer n-1
D-Node
Switch components: Direct FIFO connection for Data injection BUS connection for RISC communication Full connectivity between 2 Dnode layers
Switch
I/O
Switch
Switch
I/O
D-Node D node
D-Node
D-Node Node
D-Node
D-Node
D-Node
Layer n
I/O
I/O
Switch
Config. controller
Switch
D-Node
D-Node
D-Node
D-Node D node
Layer n+1
D node
Local mode : stand-alone Global mode : FPGA like DIOU

Camille
D-Node
Switch
D-Node
Switch
I/O
Switch
I/O
D node
D-Node D node
D-Node
I/O
66
Systolic ring
Reverse dataflow
Each switch writes computed data in his own feedback pipeline
Each switch has read ports on others switchs pipelines Easy implementation of various recursive algorithms (IIR, WT)
D-Node
Switch Switch
Feedback pipelines
D-Node
Switch
D-Node
D-Node
D-Node Node
D-Node
D-Node
D-Node
Switch
Switch
D-Node
D-Node
D-Node
D-Node
D-Node
Switch Switch
D-Node
DIOU Camille
Switch
D-Node
Master EAII D-Node Sp. RSEE
67
Systolic ring
Reverse dataflow
D-Node
Switch Switch
Feedback pipelines
D-Node
Switch
D-Node
D-Node
D-Node Node
D-Node
D-Node
D-Node
Switch
Switch
D-Node
D-Node
D-Node
D-Node
D-Node
Switch Switch
D-Node
DIOU Camille
Switch
D-Node
68
Systolic ring
Reverse dataflow
D-Node
Switch Switch
Feedback pipelines
D-Node
Switch
D-Node
D-Node
D-Node Node
D-Node
D-Node
D-Node
Switch
Switch
D-Node
D-Node
D-Node
D-Node
D-Node
Switch Switch
D-Node
DIOU Camille
Switch
D-Node
69
Systolic ring
2 levels dynamically reconfigurable architecture:

Global mode (first level)
The program which manages the configuration runs on the RISC processor 1 The configuration of an entire cluster can be modified at each clock cycle 2 The operating layer computes the data coming from the host processor 3
Local mode (second level) Each Dnode runs his own up-to-8 instructions program
OPERATING layer
+
Dnode
A B
Reg FILE AL U +M ULT
CONFIGURATION layer 2
* RAM
3
DATA
Host P
CONFIG
DIOU Camille
Config Controller
MANAGEMENT CODE Master EAII Sp. RSEE
70
Systolic ring
ST* CMOS process 0.25 m & 0.18 m
8 Dnodes version
Features :
Parametrizable core (number of Dnodes) Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system) 1600 MIPS (PII@450MHz : 400 MIPS) 3 Gb/s bandwidth
Ring-8 0.25 m Area Frquency

Low Dnode area
Ring-8 0.18 m
0.7 mm2 200 MHz
Dnode 0.18 m
0.04 mm2 200 MHz
0.9 mm2 150 MHz
Possible to realize 128 Dnodes versions
Suited as an IP core for SoC

*: ST: STmicroelectronics
71
Systolic ring
Assembly-level programming
RISC 0000instructions r:ldl(0,8) Layer M1: selection N1:clr
N2:clr
Dnodes instructions
0001 0002 0003 0004
r:ldl(1,2) M2: N1:clr N2:clr r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) r:jnz(1) M2: N1:mac(in1) N2:mac(in2) r: halt
Assembler
Prototype
File1.bin
RAM
FPGA
Simulator
Testbench
File2.m
DIOU Camille
RAM
Ring-8
72
Systolic ring
[ -1 1 0 ]
RIF filter : edge detection

Convolution mask : Assembly code
0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)
yn=xn-xn-1.
Timing diagrams
Assembler
Simulator
Testbench
File2.m
RAM
Ring-8
Input image
Output image
DIOU Camille
73
Systolic ring
Polynomial calculus
P(x)=a.x+b.x+c.x3
x x
reg0
/* load reg0,x */ /* load reg1,x */
x.x
reg1
x x
2 1
reg0.reg1
reg2
/* load reg1,x3 */
3 1
a.reg0
ACC x ALU + MULT x
/* load ACC,a.x */
4 1
b.reg1 + ACC
ACC
/* load ACC,a.x+b.x */
5 1
c.reg2 + ACC
ACC
/* load ACC,a.x+b.x+c.x3 */
DIOU Camille
74
Systolic ring
Polynomial calculus
P(x)=a.x+b.x+c.x3
x
reg0
x.x
reg1
x x x3
2 1
reg0.reg1
reg2
/* load reg1,x3 */
3 1
a.reg0
ACC x ALU + MULT x
/* load ACC,a.x */
4 1
b.reg1 + ACC
ACC
5 1
c.reg2 + ACC
ACC
DIOU Camille
75
Systolic ring
Polynomial calculus
P(x)=a.x+b.x+c.x3
a x
reg0
x.x
reg1
x x x3
2 1
reg0.reg1
reg2
/* load reg1,x3 */
3 1
a.reg0
ACC a ALU + MULT x
/* load ACC,a.x */
4 1
b.reg1 + ACC
ACC a.x
5 1
c.reg2 + ACC
ACC
DIOU Camille
76
Systolic ring
Polynomial calculus
P(x)=a.x+b.x+c.x3
b x
reg0
x.x
reg1
x x x3
2 1
reg0.reg1
reg2
/* load reg1,x3 */
3 1
a.reg0
ACC b ALU + MULT x
/* load ACC,a.x */
4 1
b.reg1 + ACC
ACC a.x+b.x
5 1
c.reg2 + ACC
ACC
DIOU Camille
77
Systolic ring
Polynomial calculus
P(x)=a.x+b.x+c.x3
c x
reg0
x.x
reg1
x x x3
2 1
reg0.reg1
reg2
/* load reg1,x3 */
3 1
a.reg0
ACC c ALU + MULT x3
/* load ACC,a.x */
4 1
b.reg1 + ACC
ACC a.x+b.x+c. x3
5 1
c.reg2 + ACC
ACC
DIOU Camille
78
Systolic ring
y(n)=ai x(ni1)
i =0
Z
N 1
xn
-1
-1
-1
-1
a0
a1
a2
aN-1
aN yn
DIOU Camille
79
Systolic ring
y(n)=a(i)x(ni1)
xn aN
Z
-1

N 1 i=0
aN-1
Z
-1
aN-2
a1
Z
-1
a0
-1
yn
y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)
xn a2
Z
-1
a1
Z
-1
a0
Z
-1
yn
DIOU Camille
80
Systolic ring
FIR implementation
3 Dnodes / layer architecture use Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle
x0, x0, x0 a2, a1, a0 x0
a2 a1 a2
Cycle 1
DIOU Camille
81
Systolic ring
FIR implementation
3 Dnodes / layer architecture use
Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle Feedback
x1, x1, x1 a2.x0 x1

a2
x1
a1
MAC
Cycle 2
a2.x0
DIOU Camille
82
Systolic ring
FIR implementation
3 Dnodes / layer architecture use
Piplelined implementation Samples are injected through dedicated lines Coefficients loaded during first cycle Feedback
x2, x2, x2 a2.x1 x2

a2
x2
a1
a2.x0+a1.x1 x2
a0
MAC
MAC
Cycle 3
a2.x1
a2.x0+a1.x1
DIOU Camille
83
Systolic ring
FIR implementation
Feedback
x3, x3, x3 a2.x2 x3

a2
x3
a1
a2.x1+a1.x2 x3
a0
MAC
MAC
Cycle 4
a2.x2
a2.x1+a1.x2 a2.x0+a1.x1 +a0.x2
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE

84
Systolic ring
Feedback
x4, x4, x4 a2.x3 x4

a2
x4
a1
a2.x2+a1.x3 x4
a0
MAC
MAC
Cycle 5
a2.x3
a2.x2+a1.x3 a2.x1+a1.x2 +a0.x3

85
Systolic ring
Feedback
x4, x4, x4 a2.x3 x4

a2
x4
a1
a2.x2+a1.x3 x4
a0
MAC
MAC
Cycle 6
a2.x3
a2.x2+a1.x3 a2.x1+a1.x2 +a0.x3

86
Systolic ring
Feedback
x5, x5, x5 a2.x4 x4

a2
x5
a1
a2.x3+a1.x4 x5
a0
MAC
MAC
Cycle 7
a2.x4
a2.x3+a1.x4 a2.x2+a1.x3 +a0.x4

87
Systolic ring
y(n)=a0.x(n1)+a1.x(n2)+a2.x(n3)+a3.x(n4)+a4.x(n5)+a5.x(n6)
xn
a2
a1
a0
MAC
MAC
MAC
Inter-layers feedback
xn
yn
a5
a4
a3
MAC
MAC
DIOU Camille
88
Systolic ring
Discrete Cosine Transform

Usually bidimensional 8x8 points DCT Very demanding algorithm
Original image
DCT
Quantification
Coding
DCT Coeff.
Quantified Coeff. inverse Quantification
Compressed image
iDCT
Decoding
Decompressed image
89
Systolic ring
DCT algorithm
Direct transform
N 1 (2 n+1)k 2 (k ) xn cos zk = 2N N n =0
k = 0,1,,N-1
Inverse transform
xn = 2 N
(2 n +1)k ( ) k z cos k
N 1 k =0
2N
n = 0,1,,N-1
with
(k ) =
1/2 for k = 0 1 else

DIOU Camille
90
2
Image
Systolic ring
64x64 points 8x8 pixels blocks 16 bits coded image

64
x0, 0 x 1, 0 . . . . . x7 , 0 x0,1 x1,1 . . . .
. . . . . . . . .
64
x0, 7 . . . . . . x7 , 7
64 blocs 8x8
Image initiale
DIOU Camille
91
Systolic ring
Implementation
Matrix implementation Even / Odd frequency decomposition of the DCT algorithm
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4 x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z x x 3 4 7
= cos (/4) = cos (/8) = sin (/8) = cos (/16) = cos (3/16) = sin (3/16) = sin (/16)
z=
2 T (N )x N
DIOU Camille
92
Systolic ring
Coefficients coding
Fixed point
N 1 n=0
zk =
2 (k N
(2 n + 1 )k x n cos 2 N
Example : n=6
= = = = = = =
0000000000010110 - = 1111111111101010 0000000000011101 - = 1111111111100011 0000000000001100 - = 1111111111110100 0000000000011111 - = 1111111111100001 0000000000011010 - = 1111111111100101 0000000000010001 - = 1111111111101111 0000000000000110 - = 1111111111111010
DIOU Camille
93
Systolic ring
Implementation :
ADD and SUB on the first Dnode layer Multiply-accumulate operations (MAC) on the second Dnodes layer
Dnode1
xn x(N-1)-n xn x(N-1)-n
+ _
Dnode2
xn + x(N-1)-n
Dnode1
z0 , z2 , z4 , z6
MAC
xn - x(N-1)-n
MAC
Dnode2
z1 , z3 , z5 , z7
Config
Coefficients
DIOU Camille
Config
94
Systolic ring
t=0
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
n=0
Dnode1
Dnode1
+ _
x0 x7 x0 x7
Dnode2 Dnode2
Config
DIOU Camille
M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)

95
Systolic ring
t=1
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
n=1
Dnode1
+ _
x1 x6 x1 x6
Dnode2
x0 + x7
Dnode1
MAC
x0 x7
MAC
,x,1,x
Dnode2
M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)

Config
96
Systolic ring
t=2
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
n=2
Dnode1
+ _
x2 x5 x2 x5
Dnode2
x1 + x6
Dnode1
MAC
x1 x6
MAC
,x,1,x
Dnode2
DIOU Camille
97
2
0
Systolic ring
t=3
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing 1 1 z
n=3
1 1 x0 + x7 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
Dnode1
+ _
x3 x4 x3 x4
Dnode2
x2 + x5
Dnode1
MAC
x2 x5
MAC
,x,1,x
Dnode2
DIOU Camille
98
Systolic ring
t=4
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
n=4
Dnode1
+ _
Dnode2
x3 + x4
Dnode1
MAC
x3 x4
MAC
,x,1,x
Dnode2
DIOU Camille
99
Systolic ring
t=5
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Computing
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
n=0
Dnode1
+ _
Dnode2
x3 + x4
Dnode1
clear
x3 x4
z0
clear
,x,1,x
Dnode2
z1
M1: N1:clear N2: clear

Config
100
Systolic ring
1 1 x0 + x7 z0 1 1 z x + x 2 = 1 6 z 4 x2 + x5 z6 x3 + x4
Computing
x0 x7 z1 z x x 3 = 1 6 z5 x2 x5 z7 x3 x4
Results
2 transforms issued each 5 machine cycles Clear performed during addition 20 cycles for 8 samples
DIOU Camille
101
Systolic ring
M0
Achievable parallelisn on a 8 Dnodes structures : Ring-8

DCT 1D - 4 first lines
Switch Dnode 1
Dnode 2
Config
M3
Dnode 1
Dnode 2
Switch
Config
Config Dnode 2 Dnode 1
M1
Config
Switch
DCT 1D - 4 last lines

DIOU Camille
Dnode 1 M2
Switch Master EAII Sp. RSEE
Dnode 2
102
Systolic ring
Overall performances
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7
5 cycles
2 partial transforms
DIOU Camille
103
Systolic ring
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7
20 cycles
1 Line 8 partial transforms
DIOU Camille
104
Systolic ring
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7
80 cycles
M0 M1
4 Lines - 32 partial transforms
DIOU Camille
105
Systolic ring
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7
80 cycles
M0 M1
4 Lines - 32 partial transforms
DIOU Camille
106
Systolic ring
80 cycles
M2
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . .
M3
. . z '0 , 7 . . . . . . z '7 , 7
8 Columns - 64 transforms
DIOU Camille
107
Systolic ring
z '0 , 0 z' 1, 0 . . . . . z '7 , 0 z '0,1 z '1,1 . . . . . . . . . . . . . . . . z '0 , 7 . . . . . . z '7 , 7
DCT 2D sur 8 points : 160 CYCLES

108
Systolic ring
Comparisons : execution time (cycles)

VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830
400 350 300
# cycles
250 200 150 100 50 0 CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII
NEC V830
DIOU Camille
VLIW
Superscalar
109

Techniques D'optimisation Architecturale

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Techniques D'optimisation Architecturale

Uploaded by

Copyright:

Available Formats

Techniques doptimisation architecturale

Camille Diou diou@univ-metz.fr

Master EAII Sp. RSEE

Tristate components (inputs/ outputs) BUS

Arithmetic and Logic Unit (ALU)

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Master EAII Sp. RSEE

Fetch FetchNext Next Instruction Instruction

Execute Execute Instruction Instruction

Master EAII Sp. RSEE

A Single accumulator machine

Data flow Control signals

16 bits wide 16M words

Master EAII Sp. RSEE

Master EAII Sp. RSEE

16 bits wide 16M words

PC Address operand Instruction path

Master EAII Sp. RSEE

16 bits wide 16M words

PC Address operand Instruction path

Master EAII Sp. RSEE

16 bits wide 16M words

PC Address operand Instruction path

Master EAII Sp. RSEE

16 bits wide 16M words

PC Address operand Instruction path

Master EAII Sp. RSEE

A simple microprocessor : Architecture

data to/from memory

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

Master EAII Sp. RSEE

A simple microprocessor : Instruction format

Master EAII Sp. RSEE

A simple microprocessor : test program

; ... ; ... ; ... ; ...

Master EAII Sp. RSEE

Compiler dependancies detection for ILP

Detect data dependency at compile time:

Master EAII Sp. RSEE

Reconfigurable computing : Instruction level parallelism (ILP)

Superscalar processors must find dataflow graph at run time

Reconfigurable computing : Instruction level parallelism (ILP)

Reconfigurable computing : Why now ?

Master EAII Sp. RSEE

RA will not replace P, but complement them

Master EAII Sp. RSEE

Systolic computing : definition

Master EAII Sp. RSEE

Systolic computing : characteristics of best RC design

Master EAII Sp. RSEE

Coarse grain RA model

Master EAII Sp. RSEE

Reconfigurable computing Actual available hardware Instructions currently in hardware