You are on page 1of 21

CHAPTER-1:

INTRODUCTION

Field-Programmable gate arrays (FPGAs) enable engineers to implement custom digital


systems on the latest technology nodes without the expensive nonrecurring engineering costs associated
with application-specific integrated circuits (ASICs). Historically, FPGAs have been mainly used as an
alternative to ASICs (when ASICs costs cannot be justified) or for ASIC prototyping. Recently, with the
increasing demand for more energy-efficient computer chips and the continuous development of high-
level synthesis (HLSs) tools , FPGAs have been gaining interest as a general-purpose accelerator in data
centers . The adoption of FPGAs into Microsoft data centers and Azure hardware is a concrete example
of using FPGAs as a general-purpose accelerator for several applications such as machine learning, search
engines, and networking. Although Kuon et al. have shown that FPGA power consumption is worse than
that of their ASIC counterparts, the fact that FPGAs can implement custom data paths with no instruction
streams means they can often achieve large throughput gains versus central processing units (CPUs) and
graphical processing units (GPUs) . FPGAs historically have also had a lower absolute power per device
compared to high-end CPUs and GPUs. Unfortunately, in recent technology nodes (14 nm and below),
FPGA power consumption has risen to be on par with high-end CPUs. Fig. 1 depicts the power
consumption of CPUs versus FPGAs across different technology nodes. The CPU power consumption
data used in this figure are obtained from the thermal design power reported by Intel for desktop CPUs .
To compute the FPGA power consumption, we used the Intel Early Power Estimator and assumed an
application with 70% core (logic, BRAM, and DSP) utilization. The figure shows that in earlier
technology nodes FPGAs consumed lower power than CPUs. However, the gap has almost disappeared
for Stratix 10 (14 nm) as the FPGA power consumption increased more than 2× compared to the previous
generation. This large increase in FPGA power consumption is mainly due to the end of Dennard scaling
. Earlier technology nodes followed Dennard scaling in which the supply voltage (Vdd) was reduced with
the transistor size such that the electric field remained constant. This allowed the semiconductor industry
to increase the density of transistors, reduce gate delays, and maintain a constant switching power density
with each successive technology node. Unfortunately, below 90 nm, Vdd scaling has significantly slowed
due to leakage-current-imposed constraints on the transistor threshold voltage (Vth).

1
Fig 1. Intel CPU and FPGA power consumption Fig 2. Nominal supply voltage and equivalent LE

trend with technology node. count of FPGAs with different technology nodes

Fig. 2 shows the supply voltage and the number of equivalent logic elements (LEs) per chip for FPGAs
manufactured by Intel using different technology nodes. Using DVS, we can reduce the pessimism in the
FPGA timing models and run each FPGA with the minimum Vdd that ensures safe operation at the desired
clock frequency. Academics have proposed different techniques to enable a DVS solution for FPGAs ,
and . All these techniques are chip- and application-specific; they scale Vdd according to the delay of the
application’s speed-limiting parts. The works presented in and show that using DVS for FPGAs results
in a 30% power reduction compared to using a fixed Vdd while clocking the FPGA with the reported
Fmax. Although DVS has not yet been commercially adopted in FPGAs, the significant increase in FPGA
power consumption has driven FPGA vendors to related techniques. Intel’s latest SmartVID solution
enables each FPGA to potentially operate at a lower voltage than the nominal supply voltage. During
manufacturing testing, Intel determines the minimum Vdd at which all the resources [lookup tables
(LUTs), routing, etc.] meet the performance requirements. This voltage is stored in nonvolatile registers
and can be read to determine the Vdd of each chip. The figure shows that while the number of LEs has
been rapidly growing, Vdd has not been scaling in recent technology nodes. The end of Dennard scaling
resulted in CPUs hitting the power wall (the ∼100 W which can be cost-effectively dissipated from a
device) many years ago . Given that a typical design on Stratix 10 or UltraScale+ can consume more than
100 W , we believe that power consumption has now become a primary design constraint for FPGAs, just
as it did for CPUs when they hit this threshold

2
CHAPTER-2:
LITERATURE SURVEY
Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage,

Ibrahim Ahmed

From this literature we learnt about, The Delay Sensitivity of existing FPGA circuits to voltage scaling,
to identify which circuits are least tolerant of reduced voltage, They measure the delay of different paths
on a Cyclone IV and Stratix V FPGA while varying vdd.by using PUT designed by them.

Limitation: It can not measure delay of a single LUT or routing multiplexor.

Advantages: It identifies the timing errors and analyses the error occurred during a rising edge or falling
edge.

 shows the system that developed to measure the delay of a path under test (PUT) which is similar
to the system presented in .

 The controller toggles the input of the source register (S) every three cycles and monitors the
output of the destination register (D).

 If the destination register does not toggle at the expected clock cycle, then a timing error has
occurred. We run this test at different clock frequencies.

 At each clock frequency, the controller exercises the path for 216 clock cycles and counts the
number of errors observed.

 We define Fmax as the maximum clock frequency that results in fewer than 215 timing errors.
This definition minimizes the effects of clock jitter on the measured Fmax .

 Our controller also identifies if the error occurred during a rising or falling transition, so we can
differentiate rise and fall delays.

3
Care and Feeding of FPGA power Supplies: A How and Why guide to success,

Nathan Enger

SmartVID is the Altera name for the technique of operating each individual FPGA at its optimal voltage,
as requested by the FPGA itself.

Limitation: Techniques in use around the industry that accomplish much the same thing.

Advantages: A piece of compiled IP inside of the FPGA can read this register and make a request over
an external bus to the power supply to provide this exact voltage

Result: All of the requirements for Altera SmartVID can be met by the LTC388x family of power supply
controllers. In addition, the
µModule regulators easily meet
the requirements and offer a
complete solution in a single
compact form

SmartVID is the Altera name for


the technique of operating each
individual FPGA at its optimal
voltage, as requested by the FPGA
itself. There is a register inside of the FPGA that contains a device-specific voltage (programmed at the
factory) at which the FPGA is guaranteed to operate efficiently. A piece of compiled IP inside of the
FPGA can read this register and make a request over an external bus to the power supply to provide this
exact voltage (Figure 4). Once the voltage is reached, it remains static during operation .The demands of
the SmartVID application on the power supply include the specific bus protocol, voltage accuracy, and
speed. The bus protocol is one of several methods that the FPGA uses to communicate its required voltage
to the power regulator. Of the available methods, PMBus is the most flexible because it addresses the
widest variety of power management ICs. The SmartVID IP uses two PMBus commands: VOUT_MODE,
and VOUT_COMMAND, with which it commands the PMBus-compliant power regulator to the correct
voltage

4
CHAPTER-3:
METHODS &METHODOLOGY
FPGA Architecture and Circuits:
To evaluate the effects of voltage scaling on the delay of current FPGAs, we need to understand the
underlying FPGA architecture and circuits. In this
article, we focus on island-style FPGAs that are
composed of a 2-D array of tiles. As shown in Fig. 3,
each tile is composed of a logic block (LB), switch
block (SB), and connection block (CB) . The LB
includes several LUTs that provide the programmable
logic of the FPGA. The SB is composed of routing
multiplexers that connect different routing wires
together and connect the output of the LB to routing
wires. Multiplexers are the basic building units of
FPGAs; they are used to build LUTs and to connect
wires together to form the routing fabric. Multiplexers
can be implemented using different topologies such as
a single-stage, two-stage, or tree topology ; different
topologies have a different area versus delay tradeoffs. Previous academic works and commercial FPGAs
[30] have used the two-stage topology to implement the routing multiplexers in the connection blocks and
SBs. Fig. 4 shows a 9-to-1 two-stage routing multiplexer; we use this topology to implement all routing
multiplexers. The gate terminal of each pass transistor is connected to an SRAM cell; based on the
configuration of the bitstream, only one of the inputs of the multiplexer will be connected to the buffer.
Since an nMOS pass transistor cannot drive its output to a full logic high (Vdd), it is important to add the
weak pull-up pMOS shown in Fig. 4 to restore the input of the buffer to a full Vdd. LUT multiplexers are
conventionally implemented using a tree topology . Fig. 5 shows part of a 6-input LUT implemented using
a tree-based multiplexer along with the local multiplexer and drivers for the fastest two inputs of the LUT.
The figure also shows the number of nMOS pass transistors in each level of the tree; the level closest to
the SRAM cells (lvl1) has 64 transistors and is connected to the slowest input of the LUT. The local
multiplexer (local MUX) shown in the figure is a routing multiplexer that selects which wire should be
connected to the corresponding LUT input. The output of the local MUX is connected, through input
drivers.

Fig. 5. Tree-based multiplexer that implements a 6-input LUT

5
COFFE 2:
Exploring new voltage-tolerant FPGA circuits entails designing the circuits and sizing their transistors. In
this article, we use COFFE 2, an automated transistor sizing tool for FPGAs . COFFE 2 divides the FPGA
tile into different subcircuits (SB MUX, CB MUX, LUT, etc.) and sizes all transistors in a subcircuit at a
time. During sizing, COFFE 2 tries to minimize the area-delay product of the FPGA tile. It uses a modified
version of the minimum width transistor area to model the area of the FPGA tile, and it measures the delay
of each subcircuit using HSPICE. COFFE 2 estimates the resistances and capacitances of the different
wires used by each subcircuit and generates an equivalent π-model for each wire. This model is used
during the HSPICE simulation to achieve a more accurate delay. COFFE 2 determines the lengths of the
wires based on the area of each subcircuit; more details on how COFEE 2 determines the lengths of the
different wires can be found in . COFFE 2 takes a description of the FPGA tile that will be sized as an
input. Throughout this article, we will use an FPGA architecture similar to the one presented in in this
architecture, the FPGA routing channel contains 320 tracks and each LB has ten 6-input LUTs.
EVALUATING THE DELAY AND POWER OF EXISTING FPGAS UNDER VOLTAGE
SCALING:
Hardware Delay Measurement of Commercial FPGAs:
We first study the delay sensitivity of existing FPGA circuits to voltage scaling so we can identify which
circuits are the least tolerant of reduced voltage. In this study, we measured the delay of different paths
on a Cyclone IV FPGA and a Stratix V FPGA while varying Vdd. Fig. 6 shows the system we developed
to measure the delay of a path under test (PUT); it is similar to the system presented in and . The controller
toggles the input of the source register (S) every three cycles and monitors the output of the destination
register (D). If the destination register does not toggle at the expected clock cycle, then a timing error has
occurred. We run this test at different clock frequencies. At each clock frequency, the controller exercises
the path for 216 clock cycles and counts the number of errors observed. We define Fmax as the maximum
clock frequency that results in fewer than 215 timing errors. This definition minimizes the effects of clock
jitter on the measured Fmax . Our controller also identifies if the error occurred during a rising or falling
transition, so we can differentiate rise and fall delays. Since we cannot measure the delay of a single LUT
or routing multiplexer, we crafted different paths such that one type of circuit dominates the path delay.
For each target FPGA (Cyclone IV and Stratix V), we built two different paths: one that is routing
dominated and another that is LUT dominated. To build a routing-dominated path, we placed the two
registers (S and D) far apart and used routing constraints to ensure that the registers are connected using
short wiring segments (R4/C4). To build an LUT-dominated path, we leveraged the fact that there are no
hard adders on Cyclone IV, so we built a 64-bit adder that Quartus mapped to a chain of LUTs.
Unfortunately, Stratix V does include a hard adder, so we could not use the same trick. For Stratix V, we
manually built a chain of LUTs and used placement and routing constraintsto force Quartus to place the
LUTs close to each other and to always use the slowest input of the LUT. According to Quartus’ timing
report, the delays of our routing-dominated paths on Cyclone IV and Stratix V are composed of 95%
interconnect (routing) delay and only 5% cell (LUT) delay. Around 90% of our LUT-dominated path
delay on Cyclone IV is from LUTs, while 62% of the LUT-dominated path delay on Stratix V is from

6
LUTs. The LUT-dominated path of
Cyclone IV has a very high LUT delay
portion because it uses the hard carry
connection between LUTs and
therefore contains almost no routing
multiplexers. For silicon
measurements, we used a DE2-115
and a DE5 board equipped with a
Cyclone IV EP4CE115F29C7 (60 nm)
and a Stratix V 5SGXEA7N2F45C2
(28 nm) FPGA, respectively. We
modified both boards to allow control
Fig. 7. Measured Fmax versus Vdd on Cyclone IV

over the supply voltage of the FPGA core. Cyclone IV and Stratix V have a nominal Vdd of 1.2 and 0.9
V, respectively. Figs. 7 and 8 show the measured Fmax of the tested paths across a range of supply
voltages normalized to each path’s measured Fmax at nominal Vdd for Cyclone IV and Stratix V,
respectively. Lowering Vdd any further than the values shown in the figures causes the FPGA to reset.
The rise and fall of Fmax of the LUT-dominated path in Cyclone IV were very similar, so we only show
one curve for this path. Fig. 7 shows that in Cyclone IV an LUT-dominated path is slightly more sensitive
to Vdd scaling than a routing-dominated path. At 0.8 V, the Fmax of an LUT-dominated path is 40% of
its value at 1.2 V, whereas the rise/fall Fmax
of a routing-dominated path drops to
47%/42% of its nominal speed. Fig. 8 shows
that the delay of the Stratix V LUT-
dominated path is significantly more
sensitive to a voltage than that of the routing-
dominated path. The fall Fmax of an LUT-
dominated path at 0.7 V drops to 52% of its
nominal value (at 0.9 V), whereas for the
routing-dominated path it drops to 62%
only. Perhaps the most interesting
observation from Fig. 8 is the counter-
intuitive behavior of the routing-dominated
path delay; its delay is increasing with Vdd
in the range above the nominal value (0.9 V). Fig 8. Measured Fmax Vs On Stratix V.
We have repeated this measurement
multiple times to confirm that this is indeed the actual silicon behavior. We believe that this behavior is
due to the fact that the gate terminals of pass transistors in a routing multiplexer are connected to a fixed
voltage that is usually gate boosted of the SB multiplexer and slowest LUT input normalized to their
delays at nominal Vdd. The LUT delay sensitivity did not change much between the two FPGAs, so we
only show one line for LUT delay. The simulated delay of the gate-boosted SB shows the same behavior
as the measured delay of the routing-dominated path in Stratix V; as Vdd goes higher than the nominal
value, the delay of the SB increases. However, the delay of the SB without gate boosting is monotone and

7
does not show this behavior, confirming that this behavior is caused by the fixed gate-boosted voltage.
Fig. 9 also confirms that the sensitivity of SB delay to Vdd is much lower than the LUT delay sensitivity
for both FPGAs (with and without gate boosting). The LUT delay increased more than 6× when Vdd was
scaled down from 0.8 to 0.6 V; this large delay increase limits the effectiveness of voltage scaling as it
greatly reduces the achievable Fmax when we reduce Vdd.

Fig 9. Normalized simulated delay of an LUT and an SB routing


multiplexer for an FPGA with and without gate boosting

Evaluating QoR of Existing P&R CAD Tools at Different Voltages:


Now that we have quantified the performance of low-level FPGA circuits with voltage scaling, our next
step is to evaluate the QoRs obtained from current FPGA placement and routing CAD tools across
different supply voltages. Existing CAD tools assume that the FPGA will be supplied with a fixed Vdd,
but with adaptive voltage scaling solutions this assumption is not true anymore. In this section, we answer
the following question: should current FPGA CAD tools take into account the different timing models
(corresponding to different supply voltage values) during the placement and routing stages? To answer
this question, we modified COFFE 2 to generate VPR (Versatile Placement and Routing) architecture
files for our baseline FPGA at different Vdd values; the architecture files are identical except for the
delays associated with each block. Using these files, we ran two VPR experiments: Vnom-optimization
experiment and Vused-optimization experiment. In the Vnom-optimization experiment, all benchmarks
were placed and routed using the nominal Vdd architecture file only, and then we reran timing analysis
using different architecture files to obtain the CP delay at different used Vdd values. In the Vused-
optimization experiment, each benchmark was placed and routed by VPR several times, once for each
used Vdd value with the architecture file containing the correct delays for that Vdd, so the algorithms
optimizing the placement and routing had accurate delay information. Each placement and routing was
timing analyzed at the corresponding used Vdd. The Vused-optimization experiment presents an upper
bound to how much the QoRs of existing CAD tools (VPR) can benefit from being aware of the different
timing models that correspond to different supply voltage values. Fig. 10 shows the normalized CP delays
of a subset of the VTR benchmarks and the geometric mean (geomean) of all benchmarks obtained from
both the Vnom-optimization and Vused-optimization flows. Each delay is normalized to the delay at
nominal Vdd. Although the physical CPs of the circuits from the Vnom-optimization and Vused-

8
optimization flows are different, as
shown in the figure, the delay geomean
of the Vnom-optimization and Vused-
optimization flows are very similar and
within CAD noise margins.2 This
similarity implies that the placement
and routing algorithms in CAD tools are
not assisted by more accurate delay
information specific to a certain
voltage, probably because they attempt
to optimize many near CPs, and so Fig 10. Normalized critical path delay of a subset of the VTR benchmarks and the
some timing error during optimization geometric mean(geomean) across all VTR benchmarks from the Vnom optimization and
is tolerable. Hence, there is no need to vused optimization experiments (results are averaged over five seeds).
change the placement and routing
optimization algorithms to take into account the delays across a range of Vdd. Possible future work could
explore if other stages of the CAD tool (such as synthesis) could potentially leverage the different timing
models to achieve better results. As a result of the increase in routing delay when Vdd increases above
the nominal value, the CP delay increases when Vdd is scaled above 0.8 V. Another interesting
observation from Fig. 10 is that at 0.6 V the CP delay of different benchmarks increase versus their value
at 0.8 V by widely varying amounts ranging from 1.6× to 3×, with the geomean increasing to 2.4× the
delay at nominal Vdd. Since results from both the Vused-optimization and Vnom-optimization flow
shows similar increases, we can conclude that this increase is not due to CAD tools focusing on
minimizing the CP delay at nominal Vdd. To reduce the timing degradation at low supply voltages, we
need to design circuits that are less sensitive to Vdd scaling.
Application Delay and Power Distribution at Different Voltages:
In this section, using our baseline FPGA and the Vnom optimization (standard) flow explained above, we
analyze how the delay and power of FPGA applications break down into contributions from the various
blocks at different supply voltages. 1) Delay: Although Fig. 9 shows that the delay of the LUT is very
sensitive to Vdd scaling, it does not show the impact of this sensitivity on the CP delay of applications.
To evaluate this impact, we analyzed the CP delay distribution 2Since the CAD tools use heuristics such
as simulated annealing to find a solution (placed and routed circuit), they do not guarantee finding the
optimal solution. These heuristics result in variations in the circuit quality. More details on the different
sources of CAD noise can be found in . Fig. 11. Breakdown of the normalized CP delay between different
FPGA circuits, averaged across the VTR benchmarks. Fig. 12. Distribution of active power averaged over
the VTR benchmarks. of all VTR benchmarks at different supply voltages. For each benchmark and
supply voltage, we divide the delay into three portions: routing, LUT, and hard blocks. The routing portion
includes the delays through SBs, CBs, and all local routing, and the hard block portion includes delays
through DSPs and BRAMs.3 Fig. 11 shows the breakdown of the average CP delay at different supply
voltages normalized to the average CP delay at 0.8 V. At nominal Vdd (0.8 V), on average 75% of the CP
delay is from routing, whereas LUT delays account for only 15% of the CP delay. However, at 0.6 V the
LUT delay contribution increases to almost 50% of the CP delay (at 0.6 V). This large increase in LUT
delay contribution highlights the importance of designing LUTs with delays that are less sensitive to Vdd
scaling. 2) Power: To model the power consumption of the VTR benchmarks, we first measured (using

9
Fig. 11. Breakdown of the normalized CP delay between different FPGA
circuits, averaged across the VTR benchmarks

HSPICE) the current drawn by every subcircuit (SB MUX, CB MUX, LUT, etc.) when the input toggles.
Since the current drawn by an LUT depends on the function implemented by the LUT (LUT mask) and
the number of inputs toggling, we performed hundreds of simulations using random LUT masks to
measure the LUT power. For each LUT mask, to measure the power of the 3Since this study focuses on
the soft fabric of the FPGA, we assumed that the supply voltage of hard blocks does not scale with the
soft fabric Vdd, so our VPR architecture files at different voltages have the same delay for hard blocks;
modern commercial FPGAs use a separate supply voltage for BRAMs , so this assumption is true for
BRAMs.

Fig. 12. Distribution of active power averaged over the VTR benchmarks

10
EXPLORING DIFFERENT LUT DESIGNS:
Decoding LUT Inputs (Decode LUT):
LUTs have traditionally been built using a tree-based multiplexer, with each level of the tree controlled
by a different input. We can instead decode groups of inputs and use the decoded values to control a single
level of pass transistors for that group of inputs. By reducing the number of pass transistors in series, input
decoding could yield an LUT with lower delay sensitivity to voltage. Fig. 14 shows part of a 3-input LUT
that decodes the slowest two inputs (a and b). In a 6-input LUT, by decoding the first two inputs, we are
able to remove the second level of the LUT shown in Fig. 5, which saves 32 pass transistors. However,
we must add 24 transistors for the decoding circuitry, resulting in a net reduction of eight transistors. We
implemented the decoding circuitry as NAND and inverter gates using CMOS logic, which is known to
be tolerant to voltage variation [48]. Using a NAND and an inverter allowed us to use minimum width
transistors for the NAND gate and size the inverter to drive the decoder load (16 pass transistors). Using
NOR gates for decoding instead would have resulted in adding fewer transistors but would have required
larger transistors in the NOR gate to drive its load of 16 pass transistors. Decoding the first three or four
inputs (instead of two) results in adding a net of 16 and 104 transistors to the LUT, respectively. The
number of transistors in the decoding circuitry grows exponentially with the number of decoded inputs,
so we designed an LUT that decodes the first two inputs only. Although one could decode any group of
inputs, we focus on the first (slowest) group of inputs as decoding them saves the most pass transistors.
In addition, both Intel’s and Xilinx’s fracturable 6-LUT architectures share at least the first two inputs
across both fractured 5-LUTs, so the same 2-input decoding circuitry could be shared between the
fractured LUTs. Decoding inputs also reduces the capacitive loading on the input drivers. Without
decoding, the driver of the first LUT input (a) is loaded by 32 pass transistors, but with decoding, it is
loaded by only two NAND gates. The load reduction enables a smaller driver and thus a smaller area.
Gate-Boosted LUTs (GB LUTs):
Motivated by the observation that the delay of gate-boosted routing multiplexers is tolerant of voltage
scaling, our second idea is to gate boost the pass transistors of the LUT multiplexer. To achieve this, we
powered the LUT input drivers (shown in Fig. 5) by the gate-boosted SRAM voltage (1 V) instead of the
lower and varying logic supply voltage. Since the remaining logic and routing are still powered by the
lower supply voltage (Vddl), we modified the local multiplexer (MUX) to convert the voltage level from
(Vddl) to the higher SRAM voltage (Vddh). Fig. 15 shows two variations of the local MUX that are able
to convert the voltage from Vddl to Vddh. It is important to perform this conversion to prevent excessive
leakage current in the input drivers. Both local MUX variations implement a two-stage multiplexer
followed by an inverter. In the variation shown in Fig. 15(a), the (gray) inverter is powered by Vddl and
a conventional level shifter is used to convert the output of this inverter to Vddh so that it can be connected
to the input drivers. In the variation shown in Fig. 15(b), the (black) inverter is powered directly with
Vddh, but it does not leak because the weak pull-up is also connected to Vddh. This pull-up ensures that
when there is a logic high at the input of the inverter, the pMOS in the inverter is completely turned off.
Although the local MUX shown in Fig.a requires fewer transistors, our experiments showed that the delay
of the local MUX in Fig.b is slightly

11
Revisiting Transmission Gates (TG LUT):

The study in [26] showed that gate-boosted pass transistors result in an FPGA tile with a better area-delay
product (at nominal Vdd) than TGs. The use of TGs results in a better delay and better variable voltage
performance at the expense of more area. TGs are less sensitive to voltage variation as they can drive their
output to full Vdd without the use of gate boosting or weak pull-up transistors. Therefore, our third idea
is to explore using TGs in LUTs only, while still using pass transistors in all routing multiplexers. This
design enables better LUT delays, especially at lower voltages, without adding area to all the routing
multiplexers of the FPGA tile.

Hybrid LUTs (Decode-GB LUT and Decode-TG LUT):

We also investigated mixing our proposed LUT ideas together; we designed an LUT that decodes the
slowest two inputs while using gate boosting (decode-GB LUT) and an LUT that decodes the slowest two
inputs while using TGs (decode-TG LUT). The decode-GB LUT uses the local MUX with the level shifter
shown in Fig. 15(a) and uses the decoding circuitry shown in Fig. 14. However, in this LUT the decoding

Fig. Variations of the local MUX to gate boost the LUT

12
circuitry is connected to the SRAM supply voltage to gate boost the first level of the LUT. Since the
decode-TG LUT uses TGs to build the LUT, it requires the decoding circuitry to output each decoded
value (ab, ab, ...) and its inverse (ab, ab, ...). To generate the decoded values and their inverse, we explored
the three different designs shown in Fig. 16. With designs d2 and d3 we can still size the NAND gate to
use minimum width transistor at the expense of adding more inverters. Although d1 uses the same number
of transistors as the decoder of the passtransistor-based LUT shown in Fig. 14, it requires more area
because it uses a larger NAND gate to be able to drive the TGs loading the NAND gate. Our experiments
showed that d1 results in an FPGA tile with 1.4% lower area-delay product compared to using d2 and d3,
so we only show results for the decoder using d1.

13
CHAPTER-4:

RESULTS

We enhanced COFFE 2 to support sizing an FPGA tile with our proposed LUT circuits. We used the
enhanced COFFE 2 to compare the area and delay of an FPGA tile using our proposed LUT designs to
those of the baseline FPGA tile, which was described in Section III. Fig. 17 shows the areas of the
proposed LUTs normalized to the baseline LUT. The figure shows that even after adding the decoding
circuitry, the decode LUT is 9% smaller than the baseline. Using TGs in the LUT (TG LUT) increases
the LUT area by 19%. However, with the decoding modification, the decode-TG LUT is only 8% larger
than the baseline. The GB LUT is also 5% smaller than the baseline because gate boosting increases the
drive of pass transistors, so COFFE 2 can size the LUT smaller and still achieve a reasonable delay
through the LUT. The gate-boosted decoded LUT (decode-GB) has the smallest area of all the proposed
LUTs; it is 15% smaller than the baseline. These results show that decoding the slowest two inputs of the
LUT results in a considerable decrease in the area of the LUT. Fig. 18 shows the delays of the proposed
LUTs through the slowest input normalized to that of the baseline LUT at 0.8 V. At nominal Vdd, the GB
LUT has the lowest delay, which is almost 50% lower than the baseline LUT delay. The delay of the
decode LUT is slightly less sensitive to Vdd than the baseline, whereas the delays of the TG LUT, Fig.
19. Normalized tile area-delay product over LUT designs. GB LUT, and their decoded variants (decode-
GB and decodeTG) show a much lower sensitivity. The delays of the TG LUT and GB LUT at 0.6 V are
1.8× and 3× their delays at 0.8 V, respectively. This increase is much smaller than the 6.7× increase in
the baseline LUT delay when reducing Vdd from 0.8 to 0.6 V. The delay of the GB LUT increases as Vdd
is increased higher than 0.8 V; this behavior is similar to the gate-boosted routing multiplexer behavior
explained earlier. The decoded variants of the GB LUT and TG LUT have a slightly higher delay than the
nondecoded variants across the 0.6–0.8 V Vdd range. Note that Fig. 18 shows the delay of the slowest
LUT input. Although the delay of the LUT through the slowest input (A) of the decode-GB/decode-TG
is slightly higher than that of the GB/TG LUT, the LUT delay through the fastest input (F) is almost the
same. To fairly evaluate our new proposed LUTs, we need to analyze the area and delay of the overall
FPGA tile (logic and routing) and not just the LUT. The tile area is the summation of the areas of all
subcircuits within the tile, including all routing multiplexers, LUTs, and SRAM cells. As explained
earlier, the delay of the tile is a representative CP that is calculated as a weighted sum of different
subcircuit delays. We multiply the area and delay together to obtain an efficiency metric, which matches

14
the metric that COFFE seeks to minimize. Fig. 19 shows the area-delay products of different FPGA tiles
using our proposed LUT designs at different Vdd values. The figure also shows the area-delay product of
the baseline FPGA and an FPGA that uses TGs instead of pass transistors in all subcircuits (TG all). All
values are normalized to the area-delay product of the baseline FPGA at a nominal Vdd. Fig. 19 shows
that at nominal Vdd the decode-GB LUT architecture achieves the best area-delay product, which is 12%
lower than the baseline area-delay product. Interestingly, all of our proposed LUT designs result in FPGA
tiles with a better area-delay product than the baseline design. The TG LUT, decode-TG LUT, GB LUT,
decode-GB, and TG All architectures show the lowest delay sensitivity to voltage, making them good
candidates for future FPGAs that operate across a range of voltages. However, due to the TG area
overhead, the TG All shows the worst area-delay product at a nominal Vdd. From the figure, we can also
observe that decode-GB LUT slightly outperforms GB LUT across all

Fig. Areas of proposed LUTs normalized to that of Fig. Normalized tile area-delay product over LUT designs
the baseline LUT

15
Area and Delay Results:

Table I shows the area-delay products of the proposed FPGA tiles across different Vddh and Vddl values,
normalized to that of the baseline at nominal Vdd. Since our circuit designs restrict Vddh to be higher
than Vddl, the table only lists values that follow this restriction. Table I shows that adding the level shifter
in the driver-island and driver-and-LUT-island FPGAs result in increasing the area-delay product by only
1.5% to 2% over the baseline when Vddl = Vddh. However, as we scale Vddh higher than Vddl the area-
delay product of the *-island FPGAs become much lower than that of the baseline. As expected, for any
Vddl the best area-delay product is achieved when Vddh is set to the highest value (1 V). The driver-and-
LUT-island FPGA tile shows lower area-delay products compared to the driver-island as its LUT is faster
due to having the internal LUT buffers connected to Vddh. Although the decode-driver-island has the
worst area-delay product when both Vddh and Vddl are at 0.6 V, it has the best area-delay product when
Vddh is at 1 V for most Vddl values (all values except 0.6 V). Moreover, across all voltage pairs, the
decode-driver-island achieves the lowest area-delay product of 0.88 at 0.8 V Vddl and 1.0 V Vddh. This
area-delay product is the same as that of the decode-GB LUT FPGA at a Vdd of 0.8 V. While the area-
delay products of all *-island FPGA tiles decrease as Vddh is increased, the largest decrease occurs when
Vddh is scaled from Vddl to Vddl+0.1 V. At 0.6 V Vddl, the areadelay product of the driver-island FPGA
decreases by more than 45% when Vddh is scaled from 0.6 to 0.7 V, whereas it decreases by only 0.6%
when Vddh is scaled from 0.9 to 1.0 V

16
Power Results:

Using our power modelling methodology, we evaluated the power consumed by the VTR benchmarks on
our proposed *-island FPGAs across different Vddl and Vddh pairs. Tables II and III show the geometric
average of E and ED2 of the VTR benchmarks, respectively. For all FPGAs, the lowest E is achieved at
the lowest Vddl, 0.6 V. Interestingly, two of the three architectures (driver-island and decode-driver
island) achieve slightly lower energy at Vddh = 0.7 V (with Vddl = 0.6 V) than they do at Vddh = 0.6 V.
This is due to the LUT consuming lower energy per transition at Vddh = 0.7 V than at Vddh = 0.6 V. At
Vddh = 0.7 V, the LUT delay decreases significantly, while the current drawn by the LUT does not
increase much because the internal buffers of the LUT are still tied to the 0.6 V Vddl; since the energy
consumed is the product of power and the computation time, this results in an energy reduction per
computation. The driver-island and driver-decode-island designs consume significantly lower energy than
driver-and-LUT-island when Vddh is higher than Vddl as they have less circuitry at the higher Vddh
voltage. As the earlier results showed, these designs are almost as fast as the driver-and-LUT-island
design, making them superior choices. For modest differences between Vddl and Vddh (0.1–0.2 V), the
driver-island and driver-decodeisland FPGAs also have energy consumption only moderately higher than
the baseline architecture (which runs all circuitry at Vddl). The geomean energy consumed by the decode-
driverisland at 0.6 V Vddl and 0.7 V Vddh is only 2% higher than that of the lowest energy of the baseline
FPGA. Table III shows that having the ability to scale Vddl and Vddh independently enables the *-island
FPGAs to achieve a better ED2 than the baseline when Vddl is at or below the nominal 0.8 V. At 0.6 V
Vddl, the lowest ED2 of the driver-and-LUT-island, driver-island, and decode-driver-island are 54%,
56%, and 64% lower than that of the baseline at 0.6 V Vdd, respectively. From all of our proposed *-
island FPGAs and the FPGAs with different LUT designs (decode-TG and decode-LUT) explored in
Section IV, the decode-driver-island achieves the lowest ED2 (when Vddl and Vddh are 0.7 and 0.9 V,
respectively) which is 26% lower than the baseline at nominal Vdd. With our proposed resource-type-
based voltage islands, FPGA users have more knobs to explore different performance-power operating
points. For example, with the decode-driver-island FPGA, if for a certain duration the users are more
interested in reducing energy consumption, they can set Vddl and Vddh to 0.6 and 0.7 V, respectively, to
minimize the energy consumption. However, if users are more concerned about performance, then they

17
can set Vddl and Vddh to 0.8 V and 1.0 V, respectively, to achieve the highest Fmax. Similar to Fig. a,
Fig. b shows the power and Fmax of different operating points of two of the VTR benchmarks on the
baseline FPGA, decode-TG LUT FPGA, and two of our *-island FPGAs. However, Fig. 25 only shows
the Pareto optimal operating points for each FPGA. Since the *-island FPGAs have two supply voltages,
they have more possible operating points than the baseline and decode-TG LUT FPGAs. As shown in the
figure, the driver-island FPGA outperforms the baseline FPGA; it can achieve a higher Fmax for the same
power consumption or lower power consumption for the same Fmax. The decode-driver-island FPGA
achieves the highest Fmax across all VTR benchmarks. Although the decode-TG LUT achieves lower
power consumption for SV3, the decode-driver-island achieves a higher Fmax for a slight increase in
power consumption. For the bm benchmark, the decode-driver-island also achieves higher Fmax with
comparable power consumption to the decode-TG LUT.

Fig.a. Power consumption of three VTR benchmarks on the different


FPGA design at different Vdd values.

Fig.b. power consumption of two VTR benchmarks on the baseline


FPGA,decode LUT FPGA,and two of our *-island FPGAs at the
18
pareto optimal operating points
CHAPTER-5:

CONCLUSION

While FPGAs are usually supplied with a fixed Vdd, recent academic and industrial efforts to reduce the
FPGA power consumption have shown that adaptive voltage scaling can reduce FPGA power
consumption significantly. However, FPGAs have conventionally been designed for fixed-Vdd operation.
In this work, we discovered that the FPGA LUT delay was extremely sensitive to voltage variation and
so we proposed several new LUT designs that were more tolerant of voltage scaling. We also showed that
at low supply voltages the LUT delay contributed more than 50% of the CP delay, but the power consumed
by LUTs accounted for only 20% of the total FPGA power consumption. Accordingly, we explored using
separate supply voltages for the LUT and the remaining FPGA circuitry to enable users to operate their
applications with different power-performance trade-offs. We showed that decoding the slowest two
inputs of the LUT improved the area-delay product of the FPGA tile. Moreover, decoding significantly
reduced the power consumed by the LUT. All our proposed LUT designs (TG LUT, GB LUT, and their
decoded variants) resulted in an FPGA tile with a lower area-delay product than that of the baseline at
nominal Vdd. Our fastest FPGA tile design was the decoded-GB LUT, which we further enhanced by
separating the LUT voltage from the remaining tile circuitry voltage to yield the decodedriver-island
architecture. Compared to a baseline FPGA, our decode-driver-island design decreased the area-delay
product of the FPGA tile by 12% at nominal Vdd (0.8 V) and by 52% at 0.6 V. Moreover, this design
achieved a lower ED2 than the baseline at nominal Vdd and below, with savings of 26% at the most
efficient ED2 operating point. Overall, this LUT design is clearly superior to the conventional pass
transistor design: it has better area-delay and ED2 products at the nominal voltage at which FPGAs have
usually operated, and much better area delay and ED2 products at the lower voltages that are increasingly
important to control power consumption.

19
REFERENCES:

[1] J. Hoe, “Technical perspective: FPGA compute acceleration is first about energy efficiency,”
Commun. ACM, vol. 59, no. 11, p. 113, Oct. 2016.
[2] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for
FPGAs: From prototyping to deployment,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 30, no. 4, pp. 473–491, Apr. 2011.
[3] Amazon. Amazon EC2 F1 Instances. Accessed: May 7, 2019. [Online]. Available:
https://aws.amazon.com/ec2/instance-types/f1/
[4] A. M. Caulfield et al., “A cloud-scale acceleration architecture,” in Proc. MICRO. Piscataway, NJ,
USA: IEEE Press, 2016, pp. 7:1–7:13.
[5] E. Chung et al., “Serving DNNs in real time at datacenter scale with project brainwave,” IEEE Micro,
vol. 38, no. 2, pp. 8–20, Mar. 2018.
[6] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 26, no. 2, pp. 203–215, Feb. 2007.
[7] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proc.
ISCA, Jun. 2014, pp. 13–24.
[8] Wikipedia. (2019). List of CPU Power Dissipation Figures—Wikipedia, the Free Encyclopedia.
Accessed: May 2, 2019. [Online]. Available:
https://en.wikipedia.org/wiki/List_of_CPU_power_dissipation_figure
[9] Early Power Estimator for Intel Stratix 10 FPGAs User Guide, Intel, Santa Clara, CA, USA, 2019.
[10] D. Frank, R. Dennard, E. Nowak, P. Solomon, Y. Taur, and H.-S. P. Wong, “Device scaling limits
of Si MOSFETs and their application dependencies,” Proc. IEEE, vol. 89, no. 3, pp. 259–288, Mar. 2001.
[11] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. Reading,
MA, USA: Addison-Wesley, 2010.
[12] D. A. Patterson and J. L. Hennessy, Computer Organization and Design MIPS Edition: The
Hardware/Software Interface, 5th ed. San Francisco, CA, USA: Morgan Kaufmann, 2013.
[13] Xilinx Power Estimator User Guide, Xilinx, San Jose, CA, USA, 2019.
[14] T. Tuan, A. Lesea, C. Kingsley, and S. Trimberger, “Analysis of within die process variation in 65
nm FPGAs,” in Proc. ISQED, Mar. 2011, pp. 1–5.
[15] K. Agarwal and S. Nassif, “Characterizing process variation in nanometer CMOS,” in Proc. DAC,
Jun. 2007, pp. 396–399.
[16] E. A. Stott et al., “Degradation in FPGAs: Measurement and modelling,” in Proc. FPGA. New York,
NY, USA: ACM, 2010, pp. 229–238.
[17] E. Stott, J. S. J. Wong, and P. Y. K. Cheung, “Degradation analysis and mitigation in FPGAs,” in
Proc. FPL, Aug. 2010, pp. 428–433.
[18] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, “A 90 nm lowpower FPGA for battery-
powered applications,” in Proc. FPGA, 2006, pp. 3–11.

20
[19] J. Nunez-Yanez, “Adaptive voltage scaling in a heterogeneous FPGA device with memory and logic
in-situ detectors,” Microprocessors Microsyst., vol. 51, pp. 227–238, Jun. 2017.
[20] J. M. Levine, E. Stott, and P. Y. K. Cheung, “Dynamic voltage frequency scaling with online slack
measurement,” in Proc. FPGA, 2014, pp. 65–74.
[21] S. Zhao, I. Ahmed, C. Lamoureux, A. Lotfi, V. Betz, and O. Trescases, “A universal self-calibrating
dynamic voltage and frequency scaling (DVFS) scheme with thermal compensation for energy savings in
FPGAs,” in Proc. APEC, Mar. 2016, pp. 1882–1887.
[22] I. Ahmed, S. Zhao, O. Trescases, and V. Betz, “Measure twice and cut once: Robust dynamic voltage
scaling for FPGAs,” in Proc. FPL, Aug. 2016, pp. 1–11. [23] AN 711: Power Reduction Features in Intel
Arria 10 Devices, Altera, San Jose, CA, USA, 2015

21

You might also like