You are on page 1of 53

ĐHBK Tp HCM

BMĐT
GV: Nguyễn Lý Thiên Trường

Chapter 3
Pipelining and Parallel Processing
(Tạo đường ống và xử lý song song)

TLTK:
1. Các slide từ sách của Prof. Parhi
2. Slide của Prof. Viktor Öwall
3. Slide của Thầy Hồ Trung Mỹ
1
Outline
 Introduction
 Pipelining of FIR Digital Filters
 Parallel Processing
 Pipelining and Parallel Processing for Low Power
 Pipelining for Lower Power
 Parallel Processing for Lower Power
 Combining Pipelining and Parallel Processing for Lower
Power

2
Basic Ideas
 Assuming you’re got:
 One washer (takes 30 minutes)
 One drier (take 40 minutes)
 One folder (take 20 minutes)

 It takes 90 minutes to wash, dry, and fold 1 load of


laundry.
 How long does 3 loads take?
 Serial processing
9:00 10:30 12:00 1:30
PM PM AM AM
30 min 40 min 20 min 30 min 40 min 20 min 30 min 40 min 20 min

If each load is done sequentially, it takes 270 minutes.


 Parallel processing
9:00 10:30
PM PM
30 min 40 min 20 min

If loads are done in parallel, it takes 90 minutes.


 Serial processing
9:00 1:30
PM AM
30 min 40 min 20 min 30 min 40 min 20 min 30 min 40 min 20 min

If each load is done sequentially, it takes 270 minutes.


 Pipelined processing
9:00 11:50
PM PM
30 min 40 min 40 min 40 min 20 min

Start each load as


soon as possible.
It takes 170 minutes.
Basic Ideas
Serial processing
a1 a2 a3 a1 a2 a3 a1 a2 a3 ⋯
 Less inter-processor communication
 Simpler processor hardware

 Parallel processing  Pipelined processing


P1 a1 a2 a3 P1 a1 b1 c1

P2 b1 b2 b3 P2 a2 b2 c2

P3 c1 c2 c3 P3 a3 b3 c3
time time
 Less inter-processor communication  More inter-processor communication
 Complicated processor hardware  Simpler processor hardware

Colors: different types of operations performed


a, b, c: different data streams processed 6
Data Dependence
 Parallel processing  Pipelined processing will
requires NO data involve inter-processor
dependence between communication
processors
Stage 1 Stage 2 Stage 3 Stage …
(Level 1) (Level 2) (Level 3) (Level …)

P1 P1

P2 P2

P3 P3
time time
 M-stage (or M-level) pipelining
 Block size L: the number of
needs (M-1) more delay elements
inputs processed in a clock cycle.
in any path from input to output.
7
 Pipelining
3.1 Introduction
 Comes from the idea of a water pipe: continue sending water
without waiting the water in the pipe to be out

Continuous sending
Tsampling = TCLK

 Using pipeline latches to reduce the critical path delay


 Can exploit to increase the clock speed of sampling speed or
reduceProcessing
 Parallel power consumption at the same speed.
 Multiple output are computed in parallel in a clock period with
parallel hardware
 Effective sampling speed is increased with the level of parallelism
 Can be used for the reduction of power consumption
Tsampling
Continuous sending
TCLK
(Tsampling = ; L = Block size) 8
3.1 Introduction
 Pipelining
 Reduce the critical path
 Increase the clock speed or sampling speed
(fsampling= fclk)
 Reduce power consumption
 Parallel Processing
 Not reduce the critical path
 Not increase clock speed, but increase
sampling speed (fsampling = L.fclk)
 Reduce power consumption
9
Example 1: A 3-tap FIR Filter
Consider a 3-tap FIR filter: y(n) = ax(n) + bx(n-1) + cx(n-2)

Critical path

 The critical path (or the minimum time required for processing a new
sample) is limited by 1 multiply and 2 add times. Thus, the “sampling
period” (or the “sampling frequency” ) is given by:
1
Tsampling  Tclock  TM  2TA or f sampling  f clock 
TM  2TA
where TM be the delay for multiplier and TA be the delay of adder
 If some real-time application requires a faster input rate (sampling
rate), then this direct-form structure cannot be used! In this case, the
critical path can be reduced by either pipelining or parallel 10
processing.
Pipelining processing
 Introduce pipelining latches along the data-path

11
Parallel processing
 Duplicate the hardware

12
Example 1: A 3-tap FIR Filter
Consider a 3-tap FIR filter: y(n) = ax(n) + bx(n-1) + cx(n-2)

Critical path

b.x(n- c.x(n-2) 3
1)
a.x(n) a.x(n) + b.x(n-1)
1 2
Clock Input Node 1 Node 2 Node 3 Output

Tcritical = TM + 2TA
3.2 Pipelining of FIR Digital Filters
 The pipelined implementation: By introducing 2 additional
latches in Example 1, Tcritical: TM+2TA  TM+TA.
Critical path
x(n-1) x(n-3)

b.x(n-1) c.x(n-3)
a.x(n)

Node 1 a.x(n-1)+b.x(n-2)

Clock Input Node 1 Node 2 Node 3 Output


Pipelining: Definitions

15
Example of Feed-forward cutsets
max{ , }

16
Pipelining: How to ?

17
Pipelining properties
 M-level pipelining needs M-1 more delay elements in any path
from input to output.
 Increase in speed with the following penalty
 Increase in system latency
 Increase in the number of latches
 Pipelining latches can only be placed across any feed-forward
cutset of the graph
 Clock period limitation: critical path may be between
 An input and a latch x(n) y(n)
+ D +
 A latch and an output b
 Two Latches X
 An input and an output D
a z(n)
Tclock Tcritical X +
18
Transposition Theorem
 The critical path of the original 3-tap FIR filter can be reduced
without introducing pipelining latches by transposing the structure
 Reversing the direction of all edges in a given SFG (Signal Flow
Graph) and interchanging the input and output ports preserve the
functionality of the system.

 Transposed SFG and Data-broadcast structure of FIR filters:

data-
broadcast

Tcritical = TM + 2TA
Tcritical = TM + TA 19
Transposed Form 4-tap FIR filter

 Critical path: the path with the longest computation time


among all paths without delay elements
20
Fine-grain pipelining
 Further breakdown the functional units by pipelining
to increase performance.
 E.g. breakdown each multiplier into 2 small units

21
Fine-grain pipelining
 Further breakdown the functional units by pipelining
to increase performance.
 E.g. breakdown each multiplier into 2 small units
 Let TM = 10 u.t. and TA = 2 u.t.. If the multiplier is
broken into 2 smaller units with processing time of 6
u.t. and 4 u.t., respectively, then the desired clock
period can be achieved as (4 + 2) = 6 u.t.

22
3.3 Parallel Processing

23
Parallel Processing (cont’d)

24
Parallel Processing (cont’d)

= x(2(k-1))

= x(10(k-1))

25
Example of Parallel 3-Tap FIR Filter with block size L=2

26
Parallel Processing of 3-Tap FIR Filter

27
Parallel processing architecture for a 3-tap FIR filter with
block size L=3

28
Parallel Processing (cont’d)

29
Complete parallel processing system with
block size L=4

Clock Period=T/4
x(n) Serial-to-Parallel
Sampling Converter
Period=T/4
x(4k+3)
x(4k+2)
x(4k+1)
x(4k)
y(4k+3)
y(4k+2) Parallel-to-Serial
Clock Period =T MIMO y(n)
y(4k+1) Converter
System
y(4k)

30
Complete parallel processing system with
block size L=4

31
Why Parallel Processing?
 Parallel leads to duplicating many copies of hardware, and the
cost increases! Why use?
 There is fundamental limit to pipelining imposed by the
input/output (I/O) bottlenecks, referred to as Communication bounded.
 Communication bounded
 Communication time (input/output
pad delay + wire-delay) is larger
than that of computation delay.
 Pipelining can only be used to Chip1 Chip2
reduce the critical path computation
delay. For communication-bound Tcommunication
o/p i/p
system, this cannot help. pad Parallel pad
 So only parallel processing can be Transmission
used to improve the performance. Tcomputation
 Further improvement can be
achieved by combining pipelining
and parallel processing. 32
Why Parallel Processing?

33
Example: Combined fine-grain pipelining and
parallel processing for 3-tap FIR filter

34
3.4 Pipelining and Parallel Processing for Low Power

Vo: supply voltage;


Vt: threshold voltage
𝑊
𝑘=𝜇 𝐶 𝑜𝑥 k: is dependent on technology
𝐿 parameters , Cox and W/L
35
3.4 Pipelining and Parallel Processing for Low Power

(original clock period)


36
Recall
 In electronics, digital circuits and digital electronics, the
propagation delay, or gate delay, is the length of time
which starts when the input to a logic gate becomes
stable and valid to change, to the time that the output of
that logic gate is stable and valid to change (source: wiki).

37
Pipelining for Low Power
 The technique of reducing power consumption is
to first use pipelining to reduce critical period by
pipelining and then restore it to the original level
by reducing the supply voltage.
 Power dissipation in original sequential circuit is
given by
Pseq=CtotalV02 f
where f = and Tseq is the clock period of the
original sequential filter.

38
Pipelining for Low Power
 Consider an M level pipelined system (i.e. having
M-1 latches). Assuming we divide the critical path
into equal segments.
Tseq

Sequential (critical path)

Tpip Tpip Tpip

Pipelined: (critical path when M=3)

 Critical Path of the pipelined system: Tpip =


 Capacitance to be charged disclosed in a single
clock cycle =
39
Pipelining for Low Power
 It is important to note that Ctotal remains same only
Ccharge changes.
 To save power, we reduce the supply voltage Vo
by a factor 0 < β <1 increase propagation delay.

Cch arg eVo


t pd 
k (Vo  Vt ) 2

40
Pipelining for Low Power
 We will choose a proper reduction factor β so that
the supply voltage β.Vo produces the same Tcritical
after pipelining.
 Original delay:
Cch arg eVo
Tseq  2
k (Vo  Vt )
 After M stage pipelining:
Cch arg e
 Vo
Tpipe  M
2
k ( Vo  Vt )
41
Pipelining for Low Power
 We choose a proper reduction factor β so that the
supply voltage β.Vo produces the same Tcritical
after pipelining, it means
 The new power consumption with pipelining:
Ppip=Ctotalb2V02 f=b2Pseq

 We can find the required level of pipelining M or


the power consumption reduction factor β2
by solving the following equation:
2 2
M ( Vo  Vt )   (Vo  Vt )
 Note that: 42
Example: Reduce Power by pipelining processing
x(n)
x(n) (6) (6) (6)
m1 m1 m1

C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)

 Assume
 Capacitance of multiplier CM is 5 times of that of an adder CA
 Fine-grain pipelining is used, and Cm1= 3CA and Cm2 = 2CA
 Vo = 5V and Vt = 0.6V
 Find the power consumption reduction factor β2?

43
Example: Reduce Power by pipelining processing
x(n)
x(n) (6) (6) (6)
m1 m1 m1

C B A D D D
m2 (4) m2 (4) m2 (4)
y(n)
D D y(n-1)
D D
(2) (2)

 For original filter: Ccharge= CM+CA= 5CA + CA = 6CA


 For pipelined filter: Ccharge= Cm1 = Cm2+CA = 3CA
 Now M = 2, we have 2x(βx5-0.6)2=βx(5-0.6)2, solving
this equation, we have β=0.6033 (note that β=0.039 is
not feasible)
 The voltage of the pipelined filter: Vpip= β.Vo =3V
 Power consumption ratio is β2 = 36.4% 44
Parallel Processing for Low Power
 In an L-parallel architecture, we can assume the
charge capacitance (i.e. Ccharge) remain the same,
but the total capacitance (i.e. Ctotal) is increased L
times more because L copies of hardware are
used.
 Since maintaining the same sampling rate, clock
period of the L-parallel architecture (i.e. Tpar) is
increased to L.Tseq 3T seq

Tseq 3Tseq

3Tseq
Sequential (critical path)

Parallel: critical path when L=3


45
Parallel Processing for Low Power
 Supply voltage can be reduced to β.Vo since
more time is allowed to charge or discharge the
same capacitance Ccharge.

Cch arg eVo


t pd 
k (Vo  Vt ) 2

46
Parallel Processing for Low Power

47
Example: Reduce Power by parallel processing
 Consider the following 4-tap FIR filters
Assumption:
 CM = 8CA
 TM = 8TA
 both architectures
operate at the
sampling period of 9TA
 Supply voltage = 3.3V
and Vt = 0.45V
Find the power
consumption
reduction factor β2?

48
Example: Reduce Power by parallel processing
Solution

(not feasible)

49
Example: Reduce Power by parallel processing

50
Example: Reduce Power by parallel processing
Solution

51
Combining Pipelining and Parallel Processing
for Low Power

52
Combining Pipelining and Parallel Processing
for Low Power

Example: Consider a system with L = M = 2, Vo = 5V, Vt


=0.6V. Find new operating voltage and reduction in power if
sampling rate is kept same.
Answer:
¿ 𝑽 𝒑𝒑 =𝟐 𝐕
53

You might also like