You are on page 1of 20

Contents

Timing optimization

Area optimization

Additional readings

Budapest University of Technology and Economics

RTL Optimization Techniques
Péter Horváth
Department of Electron Devices

August 7, 2014

Péter Horváth

RTL Optimization Techniques

1 / 20

Contents

Timing optimization

Area optimization

Additional readings

Contents

Contents

timing optimization concepts and design techniques
throughput, latency, local datapath delay
loop unrolling, removing pipeline registers, register balancing

area optimization concepts and design techniques
resource requirement metrics in standard cell ASIC and FPGA
control-based logic reuse, priority encoders, considering technology
primitives

additional readings

Péter Horváth

RTL Optimization Techniques

2 / 20

Contents Timing optimization Area optimization Additional readings Timing optimization Péter Horváth RTL Optimization Techniques 3 / 20 .

latency: The time elapsed between data input and processed data output (clock cycles). Péter Horváth RTL Optimization Techniques 4 / 20 .Contents Timing optimization Area optimization Additional readings Computation performance concepts Computation performance concepts There are three important concepts related to the computation performance. It determines the maximum clock frequency. throughput: The amount of data processed in a single clock cycle (bits per second). local datapath delays: Delay of logic between storage elements (nanoseconds).

-.stage 2 x2 <= x1. pow <= pow * x. end if. end if. stop <= '1' when count = 0 else '0'. end architecture. end if. Data n+1 is read while data n is still under processing. elsif (stop = '0') then count <= count . pow <= x. end architecture. latency: 3 cycles throuhgput: 8/3 = 2.Contents Timing optimization Area optimization Additional readings Timing optimization techniques High throughput – loop unrolling (pipeline) During the high throughput optimization the time required for processing of a single data is irrelevant but the time elapsed between two input reads is minimized. end process. throuhgput: 8/1 = 8 bits/cycle. architecture pipelined of pow3 is begin process (clk) begin if (rising_edge(clk)) then -. latency: 3 cycles Péter Horváth RTL Optimization Techniques 5 / 20 .stage 3 pow <= pow1 * x2. architecture iterative of pow3 is begin process (clk) begin if (rising_edge(clk)) then if (start = '1') then count <= 2. pow1 <= x1 * x1.stage 1 x1 <= x.7 bits/cycle. end process.1. -.

Contents Timing optimization Area optimization Additional readings Timing optimization techniques High throughput – loop unrolling (pipeline) x[31:0] 32 clk x1 x[31:0] 32 32 x 32 clk start x x2 32 0 32 32 32 1 clk 32 pow1 32 clk pow 32 x 32 pow[31:0] 32 clk pow throughput: 8/3 = 2.7 bits/cycle. latency: 3 cycles 32 pow[31:0] throughput: 8/1 = 8 bits/cycle. latency: 3 cycles Péter Horváth RTL Optimization Techniques 6 / 20 .

stage 3 pow <= pow1 * x2. end process. end process. end if. -. -.stage 2 x2 <= x1. end process. end architecture. A low-latency design uses parallelism and removes pipeline registers. pow1 <= x1 * x1. process (x1) begin x2 <= x1. pow1 <= x1 * x1.stage 1 x1 <= x. pow <= pow1 * x2. latency: 1 cycles (with an additional output register) latency: 3 cycles Péter Horváth RTL Optimization Techniques 7 / 20 . architecture pipelined of pow3 is begin process (clk) begin if (rising_edge(clk)) then -. end architecture.Contents Timing optimization Area optimization Additional readings Timing optimization techniques Low latency – removing pipeline registers The objective of the low-latency optimization is to pass the data from the input to the output with minimal internal processing delay. architecture async of pow3 is begin process (x) begin x1 <= x.

Contents Timing optimization Area optimization Additional readings Timing optimization techniques Low latency – removing pipeline registers x[31:0] 32 clk x1 x[31:0] 32 32 32 32 32 clk x x2 x 32 32 clk 32 x pow1 32 32 clk x pow 32 32 clk pow[31:0] pow 32 latency: 1 cycles pow[31:0] latency: 3 cycles Péter Horváth RTL Optimization Techniques 8 / 20 .

x2 <= x1. end if. prod2 <= B * x1. The delay of the slowest local datapath determines the maximum clock frequency. end process. RTL Optimization Techniques 9 / 20 . prod1 <= A * x. end if. The local datapath delay can be reduced by additional register layers. y <= A*x + B*x1 + C*x2. y <= prod1 + prod2 + prod3. x2 <= x1. end process. end architecture. end if. architecture single_cycle of fir is begin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x. Péter Horváth architecture multi_cycle of fir is begin process (clk) begin if (rising_edge(clk)) then if (valid = '1') then x1 <= x. prod3 <= C * x2. end architecture.Contents Timing optimization Area optimization Additional readings Timing optimization techniques Minimizing logic delay – register layers The logic between two sequential elements is called local datapath. end if.

Contents Timing optimization Area optimization Additional readings Timing optimization techniques Minimizing logic delay – register layers x[31:0] x[31:0] 32 A[31:0] 32 32 32 clk B[31:0] B[31:0] 32 32 clk 32 x 32 clk x x2 x x1 x x1 A[31:0] clk x2 C 32 32 C[31:0] x 32 32 x 32 32 32 clk 32 prod3 32 + clk clk prod2 32 prod1 32 32 + 32 clk 32 y clk y 32 y[31:0] 32 y[31:0] local datapaths: 1 adder and 1 multiplier Péter Horváth local datapaths: 1 adder or 1 multiplier RTL Optimization Techniques 10 / 20 .

sum <= reg_ab_sum + reg_c. reg_c <= in_c. end process. reg_c <= in_c. end architecture. end if. architecture not_balanced of add3 is begin process (clk) begin if (rising_edge(clk)) then reg_a <= in_a. RTL Optimization Techniques 11 / 20 . end if.Contents Timing optimization Area optimization Additional readings Timing optimization techniques Minimizing logic delay – register balancing During register balancing the logic between registers is redistributed in order to minimize the worst-case delay between any register pairs. Péter Horváth architecture balanced of add3 is begin process (clk) begin if (rising_edge(clk)) then reg_ab_sum <= in_a + in_b. end process. sum <= reg_a + reg_b + reg_c. end architecture. reg_b <= in_b.

Contents Timing optimization Area optimization Additional readings Timing optimization techniques Minimizing logic delay – register balancing in_a[31:0] in_b[31:0] 32 clk reg_b 32 clk + clk reg_ab_sum reg_c 32 32 32 in_c[31:0] 32 + reg_b 32 32 in_b[31:0] 32 clk reg_a 32 in_a[31:0] in_b[31:0] 32 32 clk + + 32 32 clk clk sum sum 32 32 sum[31:0] local datapaths: 2 adders sum[31:0] local datapaths: 1 adder Péter Horváth RTL Optimization Techniques 12 / 20 .

Contents Timing optimization Area optimization Additional readings Area optimization Péter Horváth RTL Optimization Techniques 13 / 20 .

Contents Timing optimization Area optimization Additional readings Area concepts Area concepts The resource requirement means the amount of the basic functional primitives required for implementing the described functionality. which can be simple logic gates. The basic functional primitives in standard cell ASICs are the standard cells. Péter Horváth RTL Optimization Techniques 14 / 20 . The basic logic elements (BLE) of an FPGA consists of a logic function (the input number is dependent on the vendor and the device family). such as memory blocks. flip-flops but also more complex arithmetic-logic functions or memories. There are special purpose resoures as well. signal processing elements (multipliers) etc. a flip-flop and a multiplexer.

in1 in2 in3 in4 32 32 32 32 sel reset clk 2 32 plr2 zero clk reset in4 32 32 1 3 + FSM ce plr1 32 0 32 ce in3 in2 32 + + 32 reset clk in1 sel_input zero ce_acc clk reset ss_z 32 ce 32 32 reset clk + 32 32 32 1 reset clk acc ce reset clk acc Control-based logic reuse requires an FSM to generate control signals. These resources can be reused with the cost of a reduced throughput. 32 zero acc 32 acc Péter Horváth RTL Optimization Techniques 15 / 20 . Pipeline requires internal data storage resources and additional logic to implement parallel operation.Contents Timing optimization Area optimization Additional readings Area optimization techniques Minimizing area – control-based logic reuse Control-based logic reuse should be considered the opposite operation to the loop unrolling.

The elsif statement should be used only if a priority encoder is required and the conditions are not mutually exclusive. if (ctrl(3) = '1') then output(3) <= input. end if. if (ctrl(1) = '1') then output(1) <= input. end process. end if. end if. if (ctrl(2) = '1') then output(2) <= input. elsif (ctrl(3) = '1') then output(3) <= input. end if. end if. elsif (ctrl(2) = '1') then output(2) <= input. architecture not_priority of logic is begin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input. end if. Péter Horváth RTL Optimization Techniques 16 / 20 . end architecture.Contents Timing optimization Area optimization Additional readings Area optimization techniques Minimizing area – priority encoders The resource requirement can be improved if the mutual exclusion is exploited. architecture priority of logic is begin process (clk) begin if (rising_edge(clk)) then if (ctrl(0) = '1') then output(0) <= input. end process. end if. end architecture. elsif (ctrl(1) = '1') then output(1) <= input.

Contents Timing optimization Area optimization Additional readings Area optimization techniques Minimizing area – priority encoders 32 32 input[31:0] input 0 32 32 0 output_a clk 1 sel ctrl output_a[31:0] 32 32 output_a 4 clk 1 sel ctrl output_a 4 [0] [0] 32 32 32 4 32 4 0 32 output_b clk 1 sel 0 32 output_b[31:0] output_b clk 1 sel output_b [0] [1] [1] 32 32 32 32 4 4 0 32 output_c clk 1 sel 0 32 output_c[31:0] output_c clk 1 sel output_c [0] [1] [2] [2] 32 32 32 32 4 4 0 32 output_d clk 1 sel 0 32 output_d[31:0] output_d clk 1 sel [0] [1] [2] [3] output_d [3] without exploiting mutual exlusion Péter Horváth with exploiting mutual exclusion RTL Optimization Techniques 17 / 20 .

utilizing high quality DSP units: The DSP slices in the FPGAs have synchronous outputs. utilizing block RAM modules in FPGAs: Block RAM modules do not have any reset inputs and their outputs are synchronous to a clock signal.Contents Timing optimization Area optimization Additional readings Area optimization techniques Minimizing area – considering technology primitives With appropriate HDL coding style a more efficient logic synthesis can be achieved. This restriction have to be taken into account in HDL model generation. Péter Horváth RTL Optimization Techniques 18 / 20 . The proposed coding style takes the unique characteritics of the technology primitives into consideration. The synthesis tool vendors usually provide coding technique proposals to improve the resource requirement or timing parameters of the design. Only HDL models with these parameters can be implemented in block RAMs.

end if. data_out <= content(address). end if. elsif (rising_edge(clk)) then if (write = '1') then content(address) <= data_in. data_out <= content(address). This model can be implemented as flip-flops. end if. end process. Péter Horváth architecture BRAM of RAM is begin process (clk) begin if (rising_edge(clk)) then if (write = '1') then content(address) <= data_in. end architecture. The reset function hinders the LUT implementation as well. LUT RAM and block RAM as well. end process. Because of the asynchronous output this model cannot be implemented in block RAM. RTL Optimization Techniques 19 / 20 .Contents Timing optimization Area optimization Additional readings Area optimization techniques Minimizing area – considering technology primitives architecture FFS of RAM is begin process (clk) begin if (reset = '1') then content <= (others=>(others=>'0')). end architecture. end if.

Harris – Digital Design and Computer Architecture Peter J. Ashenden – Digital Design – An Embedded System Approach Using VHDL M.Contents Timing optimization Area optimization Additional readings Additional readings Additional readings Steve Kilts – Advanced FPGA Design. Architecture. Sarah L. Implementation. Charles R. and Optimization David Money Harris. Moris Mano. Kime – Logic and Computer Design Fundamentals Pong P. Chu – RTL Hardware Design Using VHDL Peter Wilson – Design Recipes for FPGAs Péter Horváth RTL Optimization Techniques 20 / 20 .