Professional Documents
Culture Documents
DFPGA - Ch03 - Pipelining and Parallel Processing
DFPGA - Ch03 - Pipelining and Parallel Processing
BMĐT
GV: Nguyễn Lý Thiên Trường
Chapter 3
Pipelining and Parallel Processing
(Tạo đường ống và xử lý song song)
TLTK:
1. Các slide từ sách của Prof. Parhi
2. Slide của Prof. Viktor Öwall
3. Slide của Thầy Hồ Trung Mỹ
1
Outline
Introduction
Pipelining of FIR Digital Filters
Parallel Processing
Pipelining and Parallel Processing for Low Power
Pipelining for Lower Power
Parallel Processing for Lower Power
Combining Pipelining and Parallel Processing for Lower
Power
2
Basic Ideas
Assuming you’re got:
One washer (takes 30 minutes)
One drier (take 40 minutes)
One folder (take 20 minutes)
P2 b1 b2 b3 P2 a2 b2 c2
P3 c1 c2 c3 P3 a3 b3 c3
time time
Less inter-processor communication More inter-processor communication
Complicated processor hardware Simpler processor hardware
P1 P1
P2 P2
P3 P3
time time
M-stage (or M-level) pipelining
Block size L: the number of
needs (M-1) more delay elements
inputs processed in a clock cycle.
in any path from input to output.
7
Pipelining
3.1 Introduction
Comes from the idea of a water pipe: continue sending water
without waiting the water in the pipe to be out
Continuous sending
Tsampling = TCLK
Critical path
The critical path (or the minimum time required for processing a new
sample) is limited by 1 multiply and 2 add times. Thus, the “sampling
period” (or the “sampling frequency” ) is given by:
1
Tsampling Tclock TM 2TA or f sampling f clock
TM 2TA
where TM be the delay for multiplier and TA be the delay of adder
If some real-time application requires a faster input rate (sampling
rate), then this direct-form structure cannot be used! In this case, the
critical path can be reduced by either pipelining or parallel 10
processing.
Pipelining processing
Introduce pipelining latches along the data-path
11
Parallel processing
Duplicate the hardware
12
Example 1: A 3-tap FIR Filter
Consider a 3-tap FIR filter: y(n) = ax(n) + bx(n-1) + cx(n-2)
Critical path
b.x(n- c.x(n-2) 3
1)
a.x(n) a.x(n) + b.x(n-1)
1 2
Clock Input Node 1 Node 2 Node 3 Output
Tcritical = TM + 2TA
3.2 Pipelining of FIR Digital Filters
The pipelined implementation: By introducing 2 additional
latches in Example 1, Tcritical: TM+2TA TM+TA.
Critical path
x(n-1) x(n-3)
b.x(n-1) c.x(n-3)
a.x(n)
Node 1 a.x(n-1)+b.x(n-2)
15
Example of Feed-forward cutsets
max{ , }
16
Pipelining: How to ?
17
Pipelining properties
M-level pipelining needs M-1 more delay elements in any path
from input to output.
Increase in speed with the following penalty
Increase in system latency
Increase in the number of latches
Pipelining latches can only be placed across any feed-forward
cutset of the graph
Clock period limitation: critical path may be between
An input and a latch x(n) y(n)
+ D +
A latch and an output b
Two Latches X
An input and an output D
a z(n)
Tclock Tcritical X +
18
Transposition Theorem
The critical path of the original 3-tap FIR filter can be reduced
without introducing pipelining latches by transposing the structure
Reversing the direction of all edges in a given SFG (Signal Flow
Graph) and interchanging the input and output ports preserve the
functionality of the system.
data-
broadcast
Tcritical = TM + 2TA
Tcritical = TM + TA 19
Transposed Form 4-tap FIR filter
21
Fine-grain pipelining
Further breakdown the functional units by pipelining
to increase performance.
E.g. breakdown each multiplier into 2 small units
Let TM = 10 u.t. and TA = 2 u.t.. If the multiplier is
broken into 2 smaller units with processing time of 6
u.t. and 4 u.t., respectively, then the desired clock
period can be achieved as (4 + 2) = 6 u.t.
22
3.3 Parallel Processing
23
Parallel Processing (cont’d)
24
Parallel Processing (cont’d)
= x(2(k-1))
= x(10(k-1))
25
Example of Parallel 3-Tap FIR Filter with block size L=2
26
Parallel Processing of 3-Tap FIR Filter
27
Parallel processing architecture for a 3-tap FIR filter with
block size L=3
28
Parallel Processing (cont’d)
29
Complete parallel processing system with
block size L=4
Clock Period=T/4
x(n) Serial-to-Parallel
Sampling Converter
Period=T/4
x(4k+3)
x(4k+2)
x(4k+1)
x(4k)
y(4k+3)
y(4k+2) Parallel-to-Serial
Clock Period =T MIMO y(n)
y(4k+1) Converter
System
y(4k)
30
Complete parallel processing system with
block size L=4
31
Why Parallel Processing?
Parallel leads to duplicating many copies of hardware, and the
cost increases! Why use?
There is fundamental limit to pipelining imposed by the
input/output (I/O) bottlenecks, referred to as Communication bounded.
Communication bounded
Communication time (input/output
pad delay + wire-delay) is larger
than that of computation delay.
Pipelining can only be used to Chip1 Chip2
reduce the critical path computation
delay. For communication-bound Tcommunication
o/p i/p
system, this cannot help. pad Parallel pad
So only parallel processing can be Transmission
used to improve the performance. Tcomputation
Further improvement can be
achieved by combining pipelining
and parallel processing. 32
Why Parallel Processing?
33
Example: Combined fine-grain pipelining and
parallel processing for 3-tap FIR filter
34
3.4 Pipelining and Parallel Processing for Low Power
37
Pipelining for Low Power
The technique of reducing power consumption is
to first use pipelining to reduce critical period by
pipelining and then restore it to the original level
by reducing the supply voltage.
Power dissipation in original sequential circuit is
given by
Pseq=CtotalV02 f
where f = and Tseq is the clock period of the
original sequential filter.
38
Pipelining for Low Power
Consider an M level pipelined system (i.e. having
M-1 latches). Assuming we divide the critical path
into equal segments.
Tseq
40
Pipelining for Low Power
We will choose a proper reduction factor β so that
the supply voltage β.Vo produces the same Tcritical
after pipelining.
Original delay:
Cch arg eVo
Tseq 2
k (Vo Vt )
After M stage pipelining:
Cch arg e
Vo
Tpipe M
2
k ( Vo Vt )
41
Pipelining for Low Power
We choose a proper reduction factor β so that the
supply voltage β.Vo produces the same Tcritical
after pipelining, it means
The new power consumption with pipelining:
Ppip=Ctotalb2V02 f=b2Pseq
C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)
Assume
Capacitance of multiplier CM is 5 times of that of an adder CA
Fine-grain pipelining is used, and Cm1= 3CA and Cm2 = 2CA
Vo = 5V and Vt = 0.6V
Find the power consumption reduction factor β2?
43
Example: Reduce Power by pipelining processing
x(n)
x(n) (6) (6) (6)
m1 m1 m1
C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)
Tseq 3Tseq
3Tseq
Sequential (critical path)
46
Parallel Processing for Low Power
47
Example: Reduce Power by parallel processing
Consider the following 4-tap FIR filters
Assumption:
CM = 8CA
TM = 8TA
both architectures
operate at the
sampling period of 9TA
Supply voltage = 3.3V
and Vt = 0.45V
Find the power
consumption
reduction factor β2?
48
Example: Reduce Power by parallel processing
Solution
(not feasible)
49
Example: Reduce Power by parallel processing
50
Example: Reduce Power by parallel processing
Solution
51
Combining Pipelining and Parallel Processing
for Low Power
52
Combining Pipelining and Parallel Processing
for Low Power