DFPGA - Ch03 - Pipelining and Parallel Processing

ĐHBK Tp HCM
BMĐT
GV: Nguyễn Lý Thiên Trường
Chapter 3
Pipelining and Parallel Processing
(Tạo đường ống và xử lý song song)
TLTK:
1. Các slide từ sách của Prof. Parhi
2. Slide của Prof. Viktor Öwall
3. Slide của Thầy Hồ Trung Mỹ
1
Outline
 Introduction
 Pipelining of FIR Digital Filters
 Parallel Processing
 Pipelining and Parallel Processing for Low Power
 Pipelining for Lower Power
 Parallel Processing for Lower Power
 Combining Pipelining and Parallel Processing for Lower
Power
2
Basic Ideas
 Assuming you’re got:
 One washer (takes 30 minutes)
 One drier (take 40 minutes)
 One folder (take 20 minutes)
 It takes 90 minutes to wash, dry, and fold 1 load of

laundry.
 How long does 3 loads take?
 Serial processing
9:00 10:30 12:00 1:30
PM PM AM AM
30 min 40 min 20 min 30 min 40 min 20 min 30 min 40 min 20 min
If each load is done sequentially, it takes 270 minutes.

 Parallel processing
9:00 10:30
PM PM
30 min 40 min 20 min
If loads are done in parallel, it takes 90 minutes.

 Serial processing
9:00 1:30
PM AM
30 min 40 min 20 min 30 min 40 min 20 min 30 min 40 min 20 min
If each load is done sequentially, it takes 270 minutes.

 Pipelined processing
9:00 11:50
PM PM
30 min 40 min 40 min 40 min 20 min
Start each load as

soon as possible.
It takes 170 minutes.
Basic Ideas
Serial processing
a1 a2 a3 a1 a2 a3 a1 a2 a3 ⋯
 Less inter-processor communication
 Simpler processor hardware
 Parallel processing  Pipelined processing

P1 a1 a2 a3 P1 a1 b1 c1
P2 b1 b2 b3 P2 a2 b2 c2
P3 c1 c2 c3 P3 a3 b3 c3
time time
 Less inter-processor communication  More inter-processor communication
 Complicated processor hardware  Simpler processor hardware
Colors: different types of operations performed

a, b, c: different data streams processed 6
Data Dependence
 Parallel processing  Pipelined processing will
requires NO data involve inter-processor
dependence between communication
processors
Stage 1 Stage 2 Stage 3 Stage …
(Level 1) (Level 2) (Level 3) (Level …)
P1 P1
P2 P2
P3 P3
time time
 M-stage (or M-level) pipelining
 Block size L: the number of
needs (M-1) more delay elements
inputs processed in a clock cycle.
in any path from input to output.
7
 Pipelining
3.1 Introduction
 Comes from the idea of a water pipe: continue sending water
without waiting the water in the pipe to be out
Continuous sending
Tsampling = TCLK
 Using pipeline latches to reduce the critical path delay

 Can exploit to increase the clock speed of sampling speed or
reduceProcessing
 Parallel power consumption at the same speed.
 Multiple output are computed in parallel in a clock period with
parallel hardware
 Effective sampling speed is increased with the level of parallelism
 Can be used for the reduction of power consumption
Tsampling
Continuous sending
TCLK
(Tsampling = ; L = Block size) 8
3.1 Introduction
 Pipelining
 Reduce the critical path
 Increase the clock speed or sampling speed
(fsampling= fclk)
 Reduce power consumption
 Parallel Processing
 Not reduce the critical path
 Not increase clock speed, but increase
sampling speed (fsampling = L.fclk)
 Reduce power consumption
9
Example 1: A 3-tap FIR Filter
Consider a 3-tap FIR filter: y(n) = ax(n) + bx(n-1) + cx(n-2)
Critical path
 The critical path (or the minimum time required for processing a new
sample) is limited by 1 multiply and 2 add times. Thus, the “sampling
period” (or the “sampling frequency” ) is given by:
1
Tsampling  Tclock  TM  2TA or f sampling  f clock 
TM  2TA
where TM be the delay for multiplier and TA be the delay of adder
 If some real-time application requires a faster input rate (sampling
rate), then this direct-form structure cannot be used! In this case, the
critical path can be reduced by either pipelining or parallel 10
processing.
Pipelining processing
 Introduce pipelining latches along the data-path
11
Parallel processing
 Duplicate the hardware
12
Example 1: A 3-tap FIR Filter
Consider a 3-tap FIR filter: y(n) = ax(n) + bx(n-1) + cx(n-2)
Critical path
b.x(n- c.x(n-2) 3
1)
a.x(n) a.x(n) + b.x(n-1)
1 2
Clock Input Node 1 Node 2 Node 3 Output
Tcritical = TM + 2TA
3.2 Pipelining of FIR Digital Filters
 The pipelined implementation: By introducing 2 additional
latches in Example 1, Tcritical: TM+2TA  TM+TA.
Critical path
x(n-1) x(n-3)
b.x(n-1) c.x(n-3)
a.x(n)
Node 1 a.x(n-1)+b.x(n-2)
Clock Input Node 1 Node 2 Node 3 Output

Pipelining: Definitions
15
Example of Feed-forward cutsets
max{ , }
16
Pipelining: How to ?
17
Pipelining properties
 M-level pipelining needs M-1 more delay elements in any path
from input to output.
 Increase in speed with the following penalty
 Increase in system latency
 Increase in the number of latches
 Pipelining latches can only be placed across any feed-forward
cutset of the graph
 Clock period limitation: critical path may be between
 An input and a latch x(n) y(n)
+ D +
 A latch and an output b
 Two Latches X
 An input and an output D
a z(n)
Tclock Tcritical X +
18
Transposition Theorem
 The critical path of the original 3-tap FIR filter can be reduced
without introducing pipelining latches by transposing the structure
 Reversing the direction of all edges in a given SFG (Signal Flow
Graph) and interchanging the input and output ports preserve the
functionality of the system.
 Transposed SFG and Data-broadcast structure of FIR filters:
data-
broadcast
Tcritical = TM + 2TA
Tcritical = TM + TA 19
Transposed Form 4-tap FIR filter
 Critical path: the path with the longest computation time

among all paths without delay elements
20
Fine-grain pipelining
 Further breakdown the functional units by pipelining
to increase performance.
 E.g. breakdown each multiplier into 2 small units
21
Fine-grain pipelining
 Further breakdown the functional units by pipelining
to increase performance.
 E.g. breakdown each multiplier into 2 small units
 Let TM = 10 u.t. and TA = 2 u.t.. If the multiplier is
broken into 2 smaller units with processing time of 6
u.t. and 4 u.t., respectively, then the desired clock
period can be achieved as (4 + 2) = 6 u.t.
22
3.3 Parallel Processing
23
Parallel Processing (cont’d)
24
= x(2(k-1))
= x(10(k-1))
25
Example of Parallel 3-Tap FIR Filter with block size L=2
26
Parallel Processing of 3-Tap FIR Filter
27
Parallel processing architecture for a 3-tap FIR filter with
block size L=3
28
29
Complete parallel processing system with
block size L=4
Clock Period=T/4
x(n) Serial-to-Parallel
Sampling Converter
Period=T/4
x(4k+3)
x(4k+2)
x(4k+1)
x(4k)
y(4k+3)
y(4k+2) Parallel-to-Serial
Clock Period =T MIMO y(n)
y(4k+1) Converter
System
y(4k)
30
Complete parallel processing system with
block size L=4
31
Why Parallel Processing?
 Parallel leads to duplicating many copies of hardware, and the
cost increases! Why use?
 There is fundamental limit to pipelining imposed by the
input/output (I/O) bottlenecks, referred to as Communication bounded.
 Communication bounded
 Communication time (input/output
pad delay + wire-delay) is larger
than that of computation delay.
 Pipelining can only be used to Chip1 Chip2
reduce the critical path computation
delay. For communication-bound Tcommunication
o/p i/p
system, this cannot help. pad Parallel pad
 So only parallel processing can be Transmission
used to improve the performance. Tcomputation
 Further improvement can be
achieved by combining pipelining
and parallel processing. 32
Why Parallel Processing?
33
Example: Combined fine-grain pipelining and
parallel processing for 3-tap FIR filter
34
3.4 Pipelining and Parallel Processing for Low Power
Vo: supply voltage;

Vt: threshold voltage
𝑊
𝑘=𝜇 𝐶 𝑜𝑥 k: is dependent on technology
𝐿 parameters , Cox and W/L
35
3.4 Pipelining and Parallel Processing for Low Power
(original clock period)

36
Recall
 In electronics, digital circuits and digital electronics, the
propagation delay, or gate delay, is the length of time
which starts when the input to a logic gate becomes
stable and valid to change, to the time that the output of
that logic gate is stable and valid to change (source: wiki).
37
Pipelining for Low Power
 The technique of reducing power consumption is
to first use pipelining to reduce critical period by
pipelining and then restore it to the original level
by reducing the supply voltage.
 Power dissipation in original sequential circuit is
given by
Pseq=CtotalV02 f
where f = and Tseq is the clock period of the
original sequential filter.
38
 Consider an M level pipelined system (i.e. having
M-1 latches). Assuming we divide the critical path
into equal segments.
Tseq
Sequential (critical path)
Tpip Tpip Tpip
Pipelined: (critical path when M=3)
 Critical Path of the pipelined system: Tpip =

 Capacitance to be charged disclosed in a single
clock cycle =
39
 It is important to note that Ctotal remains same only
Ccharge changes.
 To save power, we reduce the supply voltage Vo
by a factor 0 < β <1 increase propagation delay.
Cch arg eVo

t pd 
k (Vo  Vt ) 2
40
 We will choose a proper reduction factor β so that
the supply voltage β.Vo produces the same Tcritical
after pipelining.
 Original delay:
Cch arg eVo
Tseq  2
k (Vo  Vt )
 After M stage pipelining:
Cch arg e
 Vo
Tpipe  M
2
k ( Vo  Vt )
41
 We choose a proper reduction factor β so that the
supply voltage β.Vo produces the same Tcritical
after pipelining, it means
 The new power consumption with pipelining:
Ppip=Ctotalb2V02 f=b2Pseq
 We can find the required level of pipelining M or

the power consumption reduction factor β2
by solving the following equation:
2 2
M ( Vo  Vt )   (Vo  Vt )
 Note that: 42
Example: Reduce Power by pipelining processing
x(n)
x(n) (6) (6) (6)
m1 m1 m1
C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)
 Assume
 Capacitance of multiplier CM is 5 times of that of an adder CA
 Fine-grain pipelining is used, and Cm1= 3CA and Cm2 = 2CA
 Vo = 5V and Vt = 0.6V
 Find the power consumption reduction factor β2?
43
Example: Reduce Power by pipelining processing
x(n)
x(n) (6) (6) (6)
m1 m1 m1
C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)
 For original filter: Ccharge= CM+CA= 5CA + CA = 6CA

 For pipelined filter: Ccharge= Cm1 = Cm2+CA = 3CA
 Now M = 2, we have 2x(βx5-0.6)2=βx(5-0.6)2, solving
this equation, we have β=0.6033 (note that β=0.039 is
not feasible)
 The voltage of the pipelined filter: Vpip= β.Vo =3V
 Power consumption ratio is β2 = 36.4% 44
Parallel Processing for Low Power
 In an L-parallel architecture, we can assume the
charge capacitance (i.e. Ccharge) remain the same,
but the total capacitance (i.e. Ctotal) is increased L
times more because L copies of hardware are
used.
 Since maintaining the same sampling rate, clock
period of the L-parallel architecture (i.e. Tpar) is
increased to L.Tseq 3T seq
Tseq 3Tseq
3Tseq
Sequential (critical path)
Parallel: critical path when L=3

45
 Supply voltage can be reduced to β.Vo since
more time is allowed to charge or discharge the
same capacitance Ccharge.
Cch arg eVo

t pd 
k (Vo  Vt ) 2
46

47
Example: Reduce Power by parallel processing
 Consider the following 4-tap FIR filters
Assumption:
 CM = 8CA
 TM = 8TA
 both architectures
operate at the
sampling period of 9TA
 Supply voltage = 3.3V
and Vt = 0.45V
Find the power
consumption
reduction factor β2?
48
Solution
(not feasible)
49
50
Solution
51
Combining Pipelining and Parallel Processing
for Low Power
52
Combining Pipelining and Parallel Processing
for Low Power
Example: Consider a system with L = M = 2, Vo = 5V, Vt

=0.6V. Find new operating voltage and reduction in power if
sampling rate is kept same.
Answer:
¿ 𝑽 𝒑𝒑 =𝟐 𝐕
53

DFPGA - Ch03 - Pipelining and Parallel Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DFPGA - Ch03 - Pipelining and Parallel Processing

Uploaded by

Copyright:

Available Formats

ĐHBK Tp HCM

 It takes 90 minutes to wash, dry, and fold 1 load of

If each load is done sequentially, it takes 270 minutes.

If loads are done in parallel, it takes 90 minutes.

If each load is done sequentially, it takes 270 minutes.

Start each load as

 Parallel processing  Pipelined processing

Colors: different types of operations performed

 Using pipeline latches to reduce the critical path delay

Clock Input Node 1 Node 2 Node 3 Output

 Transposed SFG and Data-broadcast structure of FIR filters:

 Critical path: the path with the longest computation time

Vo: supply voltage;

(original clock period)

Sequential (critical path)

Tpip Tpip Tpip

Pipelined: (critical path when M=3)

 Critical Path of the pipelined system: Tpip =

Cch arg eVo

 We can find the required level of pipelining M or

 For original filter: Ccharge= CM+CA= 5CA + CA = 6CA

Parallel: critical path when L=3

Cch arg eVo

Example: Consider a system with L = M = 2, Vo = 5V, Vt

You might also like