Advanced FPGA Design1

WiCHIP Technologies
Engineering your ideas

Advanced FPGA Design
Architecturing Speed
Vu-Duc Ngo
duc75@wichiptech.com
Contents
• 1. Introduction to Architecting Speed
• 2. High Throughput
• 3. Low Latency
• 4. Timing
• 5. Summary
1. Introduction to Architecting Speed
Introduction to Architecting Speed
• When dealing with digital designs, optimizations are required
to meet design constraints.
• One of three primary physical characteristics of a digital design
is: speed
• Three primary definitions of speed:
• Throughput: amount of data processed / clock cycle (bits/s)
• Latency: time between data input – processed data output (time or
clock cycles)
• Timing: logic delays between sequential elements (clock period,
frequency)
Introduction to Architecting Speed
• To improve speed, we need different architectures:
• High-throughput architectures: maximizing number of processed
bits/second
• Low-latency architectures: minimizing delay from input  output
• Timing optimizations: reducing combinatorial delay of the critical
path
• Techniques for timing optimization:
• Adding register layers
• Using parallel structures
• Flattening logic structures
• Balancing registers
• Reordering paths
2. High Throughput
High Throughput
• High-throughput design is concerned:
• More about the steady-state data rate
• Less about latency (i.e., the time an specific piece of data requires to
propagate through the design)
• High-throughput design employs the so-called pipeline
architecture
• New data can begin processing before the prior data has finished 
improving system throughput
• Let us consider some examples to illustrate the beauty of
pipeline architecture
High Throughput
• Consider the following piece of code:
XPower = 1;
for (i=0; i < 3; i++)
XPower = X * XPower;
• This code is an iterative algorithm:

• Same variables and addresses are accessed until the completion of
computation
• We can also implement the above code in hardware using
Verilog as follows:
High Throughput
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start); // the duration of start is a single clock
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
end
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
end
endmodule
High Throughput
Same register and computational resources are used until the completion of computation
clk [7:0]
start
* [7:0]
[7:0]
0
[7:0] D[7:0] Q7:0] [7:0]
1 [7:0] E
X[7:0]
XPower[7:0]
[7:0]
Figure 1.1: Iterative Implementation
• No new computations can begin until the previous computation has

completed
High Throughput
• Handshaking signals are required: start, finished
• Performance of the above Verilog implementation:
• Through put = 8 bits/3 clocks = 2.7 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path
• Let us have a look at a pipelined version of the same algorithm

High Throughput
module power3(
output reg [7:0] XPower,
input clk,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;
// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
High Throughput
• Pipelined implementation can be described by the following figure:
One multiplier delay in critical path
Q[7:0]
[7:0] Q[7:0] XPower[7:0]
[7:0]
[7:0]
clk Q[7:0] D[7:0] * D[7:0]
* [7:0]
X[7:0] D[7:0] [7:0] [7:0]
[7:0] Q[7:0]
D[7:0] 3rd clock: output X3

[7:0]
1st clock: compute X2
2nd clock: compute X*X2
2nd clock: compute next X2
3rd clock: compute next X*X2
3rd clock: compute next next X2
Figure 1.2: Pipelined Implementation

High Throughput
• Performance of the pipelined implementation:
• Throughput = 8 /1 or 8 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path
• Clearly, this implementation has improved throughput by a factor of 3
• Unrolling an iterative loop increases throughput

• In general, unrolling an iteration loop, which has n loops, via the use of
pipelined implementation makes throughput increase by a factor of n
• Penalty: increase in area (more registers, more multipliers)

3. Low Latency
Low latency
• Low-latency design: passes data from the input to the output as quickly
as possible
• To achieve the goal, minimizing intermediate processing delays
• Often, a low-latency design requires:
• Parallelisms
• Removal of pipelining
• Logical short cuts
• Observing the previous pipelined implementation, we can see the
possibility of reducing latency:
• At each pipeline stage, the product of each multiply must wait for the
next clock edge to propagate to the next stage
• Therefore, we expect that removing pipeline registers  reducing
latency
Low latency
module power3(
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* begin
X2 = X1;
XPower2 = XPower1*X1;
end
endmodule
Low latency
• Low-latency implementation is illustrated as follows:
Two multiplier delays in the critical path

[7:0]
[7:0]
[7:0] XPower[7:0]
*
X[7:0]
[7:0] * [7:0]
Removal of pipelining:
registers no longer exist
Figure 1.3: Low-latency Implementation

Low latency
• Performance of this low-latency design:
• Throughput = 8 bits/clock (assuming one new input per clock)
• Latency = Between one and two multiplier delays, 0 clocks
• Timing = Two multiplier delays in the critical path
• Latency can be reduced by removing pipeline registers
• Penalty: increase in combinatorial delay

4. Timing
Timing
• Timing refers to the clock speed of a design
• Maximum delay between any two sequential elements in a design will
determine the max clock speed.
• Maximum speed, or maximum frequency, is defined as follows:
1
Fmax 
Tclk-q  Tlogic  Trouting  Tsetup  Tskew
Fmax : maximum allowable frequency for clock
• where
Tclk-q : time from clock arrival until data arrives at Q
Tlogic : propagation delay through logic between flipflops
Trouting : routing delay between flipflops
Tsetup : setup time
Tskew : propagation delay of clock
Timing – Add Register Layers
• Adding intermediate layers of registers to the critical path is the
first strategy for architectural timing improvement
• This technique should be used in highly pipelined designs, where:
• An additional clock cycle latency does not violate the design
specifications
• The overall functionality will not be affected by the further addition of
registers
• For example: assume the architecture for the following FIR

implementation does not meet timing:
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
end
endmodule
• The above implementation is described by the following figure:
[7:0] [7:0]
A[7:0]
[7:0] * [7:0]
[7:0] [7:0]
B[7:0] Q[7:0] Y[7:0]
[7:0]
[7:0] * + D[7:0]
[7:0] E
C[7:0]
[7:0] [7:0]
*
clk Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0]
[7:0] One adder delay + one multiplier delay in the
[7:0] [7:0] critical path is greater than the minimum clock
validsample E E
period requirement, i.e., timing is not met
Figure 1.4: MAC with long path

• In this implementation, all multiply/add operations occurs in one clock
cycle
• To resolve the issue, we can further pipeline this design

• By adding extra registers intermediate to the multipliers
• Under the assumption that: latency requirement is not fixed at 1 clock
• We just have to add a pipeline layer between the multipliers and the
adder
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
Registers added
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
end
endmodule
• The above implementation is described by the following figure:
[7:0]
A[7:0] Q[7:0]
[7:0] Intermediate registers added
D[7:0]
[7:0] * E
[7:0] [7:0]
C[7:0] Q[7:0] Y[7:0]
Q[7:0] + D[7:0] [7:0]
* D[7:0] [7:0] [7:0] E
clk E
Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0] [7:0]
validsample E E
Shorter delay in the critical path
 timing requirement is met
B[7:0] Q[7:0]
[7:0] [7:0] D[7:0]
* E
[7:0]
[7:0]
One multiplier delay One adder delay
Figure 1.5: Pipeline registers added

• Adding register layers improves timing by dividing the critical path into
two paths of smaller delay.
• Penalty: increase in latency and area

Timing – Parallel Structures
• Reorganizing the critical path such that logic structures are
implemented in parallel is the second strategy for architectural
timing improvement
• This technique should be used whenever:
• A function that currently evaluates through a serial string of logic
• But can be broken and evaluated in parallel
• For instance, assume that the standard pipelined power-of-3 design

discussed above does not meet timing
• To create parallel structures:
• Break up the multipliers into independent operations
• Recombine them to get results
• For example:
• To compute the square of an 8-bit binary number X, i.e., X2, we first
represent:
X   A, B
where A and B are the most and the least significant nibbles, respectively
• Then, the multiply operation can be reorganized as follows:
X * X   A, B * A, B   A * A  ,  2* A * B  ,  B * B 
• By doing like this, we have reduced our problem to a series of 4-bit
multiplications, that can be handled in parallel
• Recombining the products, we get the final result for the 8-bit
multiplication
• Applying this principle, we can re-implement the power-of-3 design as
follows:
module power3(
input [7:0] X,
input clk);
reg [7:0] XPower1;
// partial product registers
reg [3:0] XPower2_ppAA, XPower2_ppAB, XPower2_ppBB;
reg [3:0] XPower3_ppAA, XPower3_ppAB, XPower3_ppBB;
reg [7:0] X1, X2;
wire [7:0] XPower2;
// nibbles for partial products (A: MS nibble, B: LS nibble)
wire [3:0] XPower1_A = XPower1[7:4];
wire [3:0] XPower1_B = XPower1[3:0];
wire [3:0] X1_A = X1[7:4];
wire [3:0] X1_B = X1[3:0];
wire [3:0] XPower2_A = XPower2[7:4];
wire [3:0] XPower2_B = XPower2[3:0]; Combine to get 8-bit results
wire [3:0] X2_A = X2[7:4];
wire [3:0] X2_B = X2[3:0];
// assemble partial products
assign XPower2 = (XPower2_ppAA << 8)+(2*XPower2_ppAB<< 4)
+XPower2_ppBB;
assign XPower = (XPower3_ppAA << 8)+(2*XPower3_ppAB << 4)
+XPower3_ppBB;
// Pipeline stage 1
X1 <= X;
XPower1 <= X; 4-bit products are handled in parallel
// Pipeline stage 2
X2 <= X1;
// create partial products
XPower2_ppAA <= XPower1_A * X1_A;
XPower2_ppAB <= XPower1_A * X1_B;
XPower2_ppBB <= XPower1_B * X1_B;
// Pipeline stage 3
// create partial products
XPower3_ppAA <= XPower2_A * X2_A;
XPower3_ppAB <= XPower2_A * X2_B;
XPower3_ppBB <= XPower2_B * X2_B;
end
endmodule
Timing – Flatten Logic Structures
• Flattening logic structures is the third strategy for architectural
timing improvement
• This technique is closely related to the idea of parallel structures, but
applies specifically to logic that is chained due to priority encoding.
• Normally, synthesis and layout tools are not smart enough to break up
logic structures coded in a serial fashion
• They also do not have enough information relating to the priority
requirements of the design
• Therefore, it is possible that they are unable to give us a configuration
with our desired timing
• For example:
• Consider the following control signals coming from an address decode
that are used to write 4 registers:
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Control signals are coded with a
priority compared to other signals
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
endmodule
• The above type of priority encoding is implemented as follows:
Combinatorial logic gates

are used due to priority
encoding,  introducing delay
Figure 1.6: Priority encoding

• Time delay is required for the priority logic  timing is not met
• Solution is to remove the priority and thereby flatten the logic
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Priority is removed. Control
always @(posedge clk) begin signals act independently
end
endmodule
clk
ctrl[3:0]
[3:0] [3]
[3] 0
in 1
[3:0]
[2] Q[7:0] rout[3:0]
[2] 0 D[7:0] [3:0]
[1]
[1] 0
1
Gates are removed. Each of the control
[0]
signals acts independently and controls
0
its corresponding rout bits independently [0]
Figure 1.7: No priority encoding

• By removing priority encodings where they are not needed, the logic
structure is flattened and the path delay is reduced
Timing – Register Balancing
• Register balancing is the fourth strategy for architectural timing
improvement
• Idea of register balancing: redistribute logic evenly between
registers to minimize worse-case delay between any two registers
• Logic is highly imbalanced between the critical path and adjacent
path
• Although many synthesis tools have the so-called register balancing

optimization, they can perform just simple cases in a predetermined
fashion
• Thus, designers must have the ability to redistribute logic in their own
way to reduce worse-case delay
• Consider the following code for an adder that adds three 8-bits inputs:
module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rA, rB, rC;
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
end
endmodule
Sum in the second stage,

determines worse-case delay
No logic in the
1st register stage
Figure 1.8: Registered adder
• Some of the logic in the critical path can be moved back a stage, thereby
balancing the logic load between two register stages.
module adder(
output reg [7:0] Sum,
input clk);
reg [7:0] rABSum, rC;
rABSum <= A + B;
rC <= C;
Sum is moved back a stage
Sum <= rABSum + rC;
end
endmodule
One of the add operation is Balance the logic between states

moved back a stage and reduce delay in the critical path
Figure 1.9: Registered balanced

Timing – Reorder Paths
• Reordering the paths in the data flow to minimize the critical
path is the fifth strategy

• Multiple paths combine with the critical path
• The combined path can be reordered such that the critical path
becomes shorter
• Consider the following module:
module randomlogic(
output reg [7:0] Out,
input clk,
input Cond1, Cond2);
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
Critical path is between C and Out: comparator in series with two gates
Figure 1.10: Long critical path

• We can modify the code to reorder the long delay of the comparator:
module randomlogic(
output reg [7:0] Out,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Gate is moved out the critical path
Shorter delay, meet timing requirement
Figure 1.11: Logic reordered to reduce critical path

5. Summary
Summary
• A high-throughput architecture: maximizes the number of
bits per second that can be processed by a design
• Unrolling an iterative loop increases throughput
• Penalty: a proportional increase in area
• A low-latency architecture: minimizes the delay from the input
of a module to the output
• Latency can be reduced by removing pipeline registers
• Penalty: an increase in combinatorial delay between registers
• Timing refers to the clock speed of a design
Summary
• A design meets timing: maximum delay between any two
sequential elements < minimum clock period
• Timing can be improved by:
• Add register layers
• Parallel structures
• Flatten logic structures
• Register Balancing
• Reorder path

Advanced FPGA Design1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced FPGA Design1

Uploaded by

Copyright:

Available Formats

WiCHIP Technologies

Engineering your ideas

• This code is an iterative algorithm:

Figure 1.1: Iterative Implementation

• No new computations can begin until the previous computation has

• Let us have a look at a pipelined version of the same algorithm

D[7:0] 3rd clock: output X3

Figure 1.2: Pipelined Implementation

• Clearly, this implementation has improved throughput by a factor of 3

• Unrolling an iterative loop increases throughput

• Penalty: increase in area (more registers, more multipliers)

Two multiplier delays in the critical path

Figure 1.3: Low-latency Implementation

• Latency can be reduced by removing pipeline registers

• Penalty: increase in combinatorial delay

• For example: assume the architecture for the following FIR

Figure 1.4: MAC with long path

• To resolve the issue, we can further pipeline this design

Figure 1.5: Pipeline registers added

• Penalty: increase in latency and area

• For instance, assume that the standard pipelined power-of-3 design

Combinatorial logic gates

Figure 1.6: Priority encoding

Figure 1.7: No priority encoding

• Although many synthesis tools have the so-called register balancing

Sum in the second stage,

One of the add operation is Balance the logic between states

Figure 1.9: Registered balanced

• This technique should be used whenever:

Figure 1.10: Long critical path

Gate is moved out the critical path

Shorter delay, meet timing requirement

Figure 1.11: Logic reordered to reduce critical path

You might also like