You are on page 1of 52

WiCHIP Technologies

Engineering your ideas


Advanced FPGA Design
Architecturing Speed
Vu-Duc Ngo
duc75@wichiptech.com
Contents
• 1. Introduction to Architecting Speed
• 2. High Throughput
• 3. Low Latency
• 4. Timing
• 5. Summary
1. Introduction to Architecting Speed
Introduction to Architecting Speed
• When dealing with digital designs, optimizations are required
to meet design constraints.
• One of three primary physical characteristics of a digital design
is: speed
• Three primary definitions of speed:
• Throughput: amount of data processed / clock cycle (bits/s)
• Latency: time between data input – processed data output (time or
clock cycles)
• Timing: logic delays between sequential elements (clock period,
frequency)
Introduction to Architecting Speed
• To improve speed, we need different architectures:
• High-throughput architectures: maximizing number of processed
bits/second
• Low-latency architectures: minimizing delay from input  output
• Timing optimizations: reducing combinatorial delay of the critical
path
• Techniques for timing optimization:
• Adding register layers
• Using parallel structures
• Flattening logic structures
• Balancing registers
• Reordering paths
2. High Throughput
High Throughput
• High-throughput design is concerned:
• More about the steady-state data rate
• Less about latency (i.e., the time an specific piece of data requires to
propagate through the design)
• High-throughput design employs the so-called pipeline
architecture
• New data can begin processing before the prior data has finished 
improving system throughput
• Let us consider some examples to illustrate the beauty of
pipeline architecture
High Throughput
• Consider the following piece of code:
XPower = 1;
for (i=0; i < 3; i++)
XPower = X * XPower;

• This code is an iterative algorithm:


• Same variables and addresses are accessed until the completion of
computation
• We can also implement the above code in hardware using
Verilog as follows:
High Throughput
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start); // the duration of start is a single clock
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
end
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
end
endmodule
High Throughput
Same register and computational resources are used until the completion of computation

clk [7:0]

start
* [7:0]
[7:0]
0
[7:0] D[7:0] Q7:0] [7:0]
1 [7:0] E
X[7:0]
XPower[7:0]
[7:0]

Figure 1.1: Iterative Implementation

• No new computations can begin until the previous computation has


completed
High Throughput
• Handshaking signals are required: start, finished
• Performance of the above Verilog implementation:
• Through put = 8 bits/3 clocks = 2.7 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Let us have a look at a pipelined version of the same algorithm


High Throughput
module power3(
output reg [7:0] XPower,
input clk,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;
// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
High Throughput
• Pipelined implementation can be described by the following figure:
One multiplier delay in critical path

Q[7:0]
[7:0] Q[7:0] XPower[7:0]
[7:0]
[7:0]
clk Q[7:0] D[7:0] * D[7:0]
* [7:0]
X[7:0] D[7:0] [7:0] [7:0]
[7:0] Q[7:0]

D[7:0] 3rd clock: output X3


[7:0]
1st clock: compute X2
2nd clock: compute X*X2
2nd clock: compute next X2
3rd clock: compute next X*X2
3rd clock: compute next next X2

Figure 1.2: Pipelined Implementation


High Throughput
• Performance of the pipelined implementation:
• Throughput = 8 /1 or 8 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Clearly, this implementation has improved throughput by a factor of 3

• Unrolling an iterative loop increases throughput


• In general, unrolling an iteration loop, which has n loops, via the use of
pipelined implementation makes throughput increase by a factor of n

• Penalty: increase in area (more registers, more multipliers)


3. Low Latency
Low latency
• Low-latency design: passes data from the input to the output as quickly
as possible
• To achieve the goal, minimizing intermediate processing delays
• Often, a low-latency design requires:
• Parallelisms
• Removal of pipelining
• Logical short cuts
• Observing the previous pipelined implementation, we can see the
possibility of reducing latency:
• At each pipeline stage, the product of each multiply must wait for the
next clock edge to propagate to the next stage
• Therefore, we expect that removing pipeline registers  reducing
latency
Low latency
module power3(
output [7:0] XPower,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* begin
X2 = X1;
XPower2 = XPower1*X1;
end
endmodule
Low latency
• Low-latency implementation is illustrated as follows:

Two multiplier delays in the critical path


[7:0]
[7:0]
[7:0] XPower[7:0]
*
X[7:0]
[7:0] * [7:0]

Removal of pipelining:
registers no longer exist

Figure 1.3: Low-latency Implementation


Low latency
• Performance of this low-latency design:
• Throughput = 8 bits/clock (assuming one new input per clock)
• Latency = Between one and two multiplier delays, 0 clocks
• Timing = Two multiplier delays in the critical path

• Latency can be reduced by removing pipeline registers

• Penalty: increase in combinatorial delay


4. Timing
Timing
• Timing refers to the clock speed of a design
• Maximum delay between any two sequential elements in a design will
determine the max clock speed.
• Maximum speed, or maximum frequency, is defined as follows:
1
Fmax 
Tclk-q  Tlogic  Trouting  Tsetup  Tskew
Fmax : maximum allowable frequency for clock
• where
Tclk-q : time from clock arrival until data arrives at Q
Tlogic : propagation delay through logic between flipflops
Trouting : routing delay between flipflops
Tsetup : setup time
Tskew : propagation delay of clock
Timing – Add Register Layers
• Adding intermediate layers of registers to the critical path is the
first strategy for architectural timing improvement
• This technique should be used in highly pipelined designs, where:
• An additional clock cycle latency does not violate the design
specifications
• The overall functionality will not be affected by the further addition of
registers

• For example: assume the architecture for the following FIR


implementation does not meet timing:
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0] [7:0]
A[7:0]

[7:0] * [7:0]
[7:0] [7:0]
B[7:0] Q[7:0] Y[7:0]
[7:0]
[7:0] * + D[7:0]
[7:0] E
C[7:0]

[7:0] [7:0]
*
clk Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0]
[7:0] One adder delay + one multiplier delay in the
[7:0] [7:0] critical path is greater than the minimum clock
validsample E E
period requirement, i.e., timing is not met

Figure 1.4: MAC with long path


• In this implementation, all multiply/add operations occurs in one clock
cycle
Timing – Add Register Layers

• To resolve the issue, we can further pipeline this design


• By adding extra registers intermediate to the multipliers
• Under the assumption that: latency requirement is not fixed at 1 clock

• We just have to add a pipeline layer between the multipliers and the
adder
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
Registers added
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0]
A[7:0] Q[7:0]
[7:0] Intermediate registers added
D[7:0]
[7:0] * E

[7:0] [7:0]
C[7:0] Q[7:0] Y[7:0]
Q[7:0] + D[7:0] [7:0]
* D[7:0] [7:0] [7:0] E
clk E
Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0] [7:0]
validsample E E
Shorter delay in the critical path
 timing requirement is met
B[7:0] Q[7:0]
[7:0] [7:0] D[7:0]
* E
[7:0]

[7:0]
One multiplier delay One adder delay

Figure 1.5: Pipeline registers added


Timing – Add Register Layers

• Adding register layers improves timing by dividing the critical path into
two paths of smaller delay.

• Penalty: increase in latency and area


Timing – Parallel Structures
• Reorganizing the critical path such that logic structures are
implemented in parallel is the second strategy for architectural
timing improvement
• This technique should be used whenever:
• A function that currently evaluates through a serial string of logic
• But can be broken and evaluated in parallel

• For instance, assume that the standard pipelined power-of-3 design


discussed above does not meet timing
• To create parallel structures:
• Break up the multipliers into independent operations
• Recombine them to get results
Timing – Parallel Structures
• For example:
• To compute the square of an 8-bit binary number X, i.e., X2, we first
represent:
X   A, B
where A and B are the most and the least significant nibbles, respectively
• Then, the multiply operation can be reorganized as follows:

X * X   A, B * A, B   A * A  ,  2* A * B  ,  B * B 
• By doing like this, we have reduced our problem to a series of 4-bit
multiplications, that can be handled in parallel
• Recombining the products, we get the final result for the 8-bit
multiplication
Timing – Parallel Structures
• Applying this principle, we can re-implement the power-of-3 design as
follows:

module power3(
output [7:0] XPower,
input [7:0] X,
input clk);
reg [7:0] XPower1;
// partial product registers
reg [3:0] XPower2_ppAA, XPower2_ppAB, XPower2_ppBB;
reg [3:0] XPower3_ppAA, XPower3_ppAB, XPower3_ppBB;
reg [7:0] X1, X2;
wire [7:0] XPower2;
Timing – Parallel Structures
// nibbles for partial products (A: MS nibble, B: LS nibble)
wire [3:0] XPower1_A = XPower1[7:4];
wire [3:0] XPower1_B = XPower1[3:0];
wire [3:0] X1_A = X1[7:4];
wire [3:0] X1_B = X1[3:0];
wire [3:0] XPower2_A = XPower2[7:4];
wire [3:0] XPower2_B = XPower2[3:0]; Combine to get 8-bit results
wire [3:0] X2_A = X2[7:4];
wire [3:0] X2_B = X2[3:0];
// assemble partial products
assign XPower2 = (XPower2_ppAA << 8)+(2*XPower2_ppAB<< 4)
+XPower2_ppBB;
assign XPower = (XPower3_ppAA << 8)+(2*XPower3_ppAB << 4)
+XPower3_ppBB;
Timing – Parallel Structures
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X; 4-bit products are handled in parallel
// Pipeline stage 2
X2 <= X1;
// create partial products
XPower2_ppAA <= XPower1_A * X1_A;
XPower2_ppAB <= XPower1_A * X1_B;
XPower2_ppBB <= XPower1_B * X1_B;
// Pipeline stage 3
// create partial products
XPower3_ppAA <= XPower2_A * X2_A;
XPower3_ppAB <= XPower2_A * X2_B;
XPower3_ppBB <= XPower2_B * X2_B;
end
endmodule
Timing – Flatten Logic Structures
• Flattening logic structures is the third strategy for architectural
timing improvement
• This technique is closely related to the idea of parallel structures, but
applies specifically to logic that is chained due to priority encoding.

• Normally, synthesis and layout tools are not smart enough to break up
logic structures coded in a serial fashion
• They also do not have enough information relating to the priority
requirements of the design
• Therefore, it is possible that they are unable to give us a configuration
with our desired timing
Timing – Flatten Logic Structures
• For example:
• Consider the following control signals coming from an address decode
that are used to write 4 registers:
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Control signals are coded with a
priority compared to other signals
always @(posedge clk)
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
else if(ctrl[2]) rout[2] <= in;
else if(ctrl[3]) rout[3] <= in;
endmodule
Timing – Flatten Logic Structures
• The above type of priority encoding is implemented as follows:

Combinatorial logic gates


are used due to priority
encoding,  introducing delay

Figure 1.6: Priority encoding


Timing – Flatten Logic Structures
• Time delay is required for the priority logic  timing is not met
• Solution is to remove the priority and thereby flatten the logic

module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Priority is removed. Control
always @(posedge clk) begin signals act independently
if(ctrl[0]) rout[0] <= in;
if(ctrl[1]) rout[1] <= in;
if(ctrl[2]) rout[2] <= in;
if(ctrl[3]) rout[3] <= in;
end
endmodule
Timing – Flatten Logic Structures
clk

ctrl[3:0]
[3:0] [3]
[3] 0

in 1
[3:0]
[2] Q[7:0] rout[3:0]
[2] 0 D[7:0] [3:0]

[1]
[1] 0

1
Gates are removed. Each of the control
[0]
signals acts independently and controls
0
its corresponding rout bits independently [0]

Figure 1.7: No priority encoding


Timing – Flatten Logic Structures

• By removing priority encodings where they are not needed, the logic
structure is flattened and the path delay is reduced
Timing – Register Balancing
• Register balancing is the fourth strategy for architectural timing
improvement
• Idea of register balancing: redistribute logic evenly between
registers to minimize worse-case delay between any two registers
• This technique should be used whenever:
• Logic is highly imbalanced between the critical path and adjacent
path

• Although many synthesis tools have the so-called register balancing


optimization, they can perform just simple cases in a predetermined
fashion
• Thus, designers must have the ability to redistribute logic in their own
way to reduce worse-case delay
Timing – Register Balancing
• Consider the following code for an adder that adds three 8-bits inputs:

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rA, rB, rC;
always @(posedge clk) begin
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
end
endmodule
Timing – Register Balancing

Sum in the second stage,


determines worse-case delay

No logic in the
1st register stage
Figure 1.8: Registered adder
Timing – Register Balancing
• Some of the logic in the critical path can be moved back a stage, thereby
balancing the logic load between two register stages.

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rABSum, rC;
always @(posedge clk) begin
rABSum <= A + B;
rC <= C;
Sum is moved back a stage
Sum <= rABSum + rC;
end
endmodule
Timing – Register Balancing

One of the add operation is Balance the logic between states


moved back a stage and reduce delay in the critical path

Figure 1.9: Registered balanced


Timing – Reorder Paths
• Reordering the paths in the data flow to minimize the critical
path is the fifth strategy

• This technique should be used whenever:


• Multiple paths combine with the critical path
• The combined path can be reordered such that the critical path
becomes shorter
Timing – Reorder Paths
• Consider the following module:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
always @(posedge clk)
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
Timing – Reorder Paths

Critical path is between C and Out: comparator in series with two gates

Figure 1.10: Long critical path


Timing – Reorder Paths
• We can modify the code to reorder the long delay of the comparator:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
always @(posedge clk)
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Timing – Reorder Paths

Gate is moved out the critical path

Shorter delay, meet timing requirement

Figure 1.11: Logic reordered to reduce critical path


5. Summary
Summary
• A high-throughput architecture: maximizes the number of
bits per second that can be processed by a design
• Unrolling an iterative loop increases throughput
• Penalty: a proportional increase in area
• A low-latency architecture: minimizes the delay from the input
of a module to the output
• Latency can be reduced by removing pipeline registers
• Penalty: an increase in combinatorial delay between registers
• Timing refers to the clock speed of a design
Summary
• A design meets timing: maximum delay between any two
sequential elements < minimum clock period
• Timing can be improved by:
• Add register layers
• Parallel structures
• Flatten logic structures
• Register Balancing
• Reorder path

You might also like