You are on page 1of 64

ECT 393: FPGA BASED SYSTEM DESIGN

S5 HONOURS
MODULE 4

Module 4:
Placement and Routing: Programmable interconnect - Partitioning and Placement,
Routing resources, delays; Applications -Embedded system design using FPGAs,
DSP using FPGAs.
PROGRAMMABLE INTERCONNECTS
• A key element of an FPGA is the general-purpose programmable interconnect
interspersed between the programmable logic blocks.
• There are different types of interconnection resources in all commercial
FPGAs.
• Every vendor has its own specific names for the different types of
interconnects in their FPGA.
Interconnects in Symmetric Array FPGAs
• Many FPGAs use switch matrices that provide interconnections between
routing wires connected to the switch matrix (General purpose interconnect)
General-Purpose Interconnect:
• A typical switch matrix has a switch at each intersection (i.e., wherever the lines cross).
• A switch matrix that supports every possible connection from every
wire to every other wire is very expensive.
• The connectivity is often limited to some subset of a full crossbar connection; moreover, not all
connections might be possible simultaneously.
• In the switch matrix illustrated in Figure (b), each wire from
a side of the switch can be routed to other wires using some combination of the switches.
• In order to support this type of connection, each cross point in the switch matrix must support six
possible interconnections, as shown in Figure (c).
• Depending on the programming technology, SRAM cells, flash memory
cells, or antifuse connections control the configuration of the switches.
• The switch matrices interspersed between the logic blocks in an FPGA
allow general-purpose interconnectivity between arbitrary points in the chip.
• However, the switch matrices are expensive in area and time (delay).
• If a signal passes through several of these switch matrices, it could contribute to a
significant signal delay.
• Moreover, the delays are variable and unpredictable depending on the number of the
switch matrices involved in each signal.

• Many FPGAs provide special connections between adjacent logic blocks.


These interconnects are fast because they do not go through the
routing matrix.
Direct Interconnects:
• Many FPGAs provide direct interconnections to the four nearest neighbors: top,
bottom, left, and right.
• In some cases, there are special interconnections to 8 neighboring blocks, including
the diagonally located logic blocks
• The direct interconnections do not go through the switch matrix but are implemented
with dedicated switches resulting in smaller delays. These types of direct
interconnects are used in some Xilinx FPGAs.
Global lines:
• For purposes such as high fan-out and low-skew clock distribution, most
FPGAs provide routing lines that span the entire width/height of device.
• A limited number (two or four) of such global lines are provided by many
FPGAs in the horizontal and vertical directions
INTERCONNECTS IN ROW-BASED FPGAS
• In devices that are row based, there are rows of logic blocks and there are
channels of switches to enable connections between the logic blocks.
• Several switches are used to route a signal from a logic block in one row to
another logic block elsewhere in the chip.
• There are arrays of switches in the routing channel between the rows of
logic.
• The interconnects in row-based channeled architecture can be classified into
two categories—non-segmented routing and segmented routing.
non-segmented channel routing
• There are three horizontal rows or tracks in this figure.
• There are several vertical wires and switches at the crosspoints.
• The switches technically can use any programming technology (SRAM, EPROM, or
antifuse), although FPGAs that use this type of routing are typically antifuse FPGAs.
• Desired connectivity is obtained by programming the appropriate switches.
• Connectivity between the points marked “x” is obtained by the two switches at row 1,
columns 1 and 4. Typically this is called net “x”.
• Net ‘x’ simply means a wire that is named “x”. The connectivity for net ‘y’ is obtained by
programming the switches at row 2, columns 2 and 7. It may be noticed that row 1 cannot
be used for any other connections other than net “x”.
• A problem with this type of interconnect resource is that a full-length track (i.e., an entire
row) is used even for a short net. The area overhead of this type of routing is very high for
this reason
• Segmented track
• In order to reduce the area overhead associated with using full-length tracks for
each net, we can use segmented tracks
• Instead of being full length, a track is divided into segments.
• If a track in row 1 is segmented into two segments, one could use the same track for one
more net.
• For example, nets “x” and “z” can both be routed on row 1 in Figure c). That is the
principle of segmented track routing.
• More nets can be routed using the same number of tracks;
• However, when long nets are desired, intersegment switches must be used to join the
segments. These switches introduce more resistance and capacitance into the net.
However, the overall routing resource area will reduce with segmented routing.
ACTEL ACT FAMILY INTERCONNECT

▪ The channel routing uses dedicated rectangular areas of fixed size


within the chip called wiring channels or just channels.
▪ The horizontal channels run across the chip in the horizontal direction.
▪ The vertical channels run over the top of the basic logic cells in the
vertical direction .
▪ Within the horizontal or vertical channels wires run horizontally or
vertically, respectively, within tracks.
▪ The channel capacity refers to the number of tracks it contains. Each
track holds one wire.
▪ In a channeled gate array the designer decides the location and length
of the interconnect within a channel. But, in an FPGA the interconnect
is fixed at the time of manufacture.
▪ Each Logic Module is provided with the input stubs and output stubs.
Interconnect architecture used in an Actel ACT family FPGA
22
Detailed view of the channel and the connections to each Logic Module
SEGMENTED CHANNEL ROUTING
▪ To allow programming of the interconnect, Actel divides the fixed
interconnect wires within each channel into wires of various
lengths called wire segments.

▪ This is referred as segmented channel routing, a type of channel


routing.

▪ Antifuses join the wire segments. The designer then programs the
interconnections by blowing antifuses and making connections
between wire segments; unwanted connections are left
unprogrammed.

▪ A statistical analysis of many different layouts determines the


optimum number and the lengths of the wire segments.
ACT1 INTERCONNECTION ARCHITECTURE

Horizontal tracks

▪ The ACT1 interconnection architecture uses a total of 25 tracks


per channel ( 22 horizontal tracks per channel for signal routing +
3 tracks dedicated to VDD, GND, and the global clock (GCLK))

▪ Horizontal segments vary in length from 4 columns of Logic


Modules to the entire row of modules (Actel calls these long
segments as long lines).
ACT1 INTERCONNECTION ARCHITECTURE
Vertical tracks
▪ Each ACT1 logic module has 8 inputs ( 4 input stubs on top and 4 on bottom)

▪ 8 vertical tracks per LM are available for inputs (4 from the LM above the
channel and 4 from the LM below).These connections are the input stubs.

▪ The single LM output connects to a vertical track that extends across the 2
channels above the module and across the 2 channels below the module
(output stub). Since this is a dedicated connection, no antifuse is needed.

▪ Thus module outputs use 4 vertical tracks per module (counting 2 tracks from the
modules below, and 2 tracks from the modules above each channel).
ACT1 INTERCONNECTION ARCHITECTURE
Vertical tracks

▪ One vertical track per column is a long vertical track (


LVT ) that spans the entire height of the chip (Actel
1020 contains some segmented LVTs).

▪ There are thus a total of 13 vertical tracks per column


in the ACT 1 architecture (8 for inputs, 4 for outputs,
and 1 for an LVT).
ACT1 INTERCONNECTION ARCHITECTURE
▪ The ACT 1 devices are very nearly fully populated (an antifuse at
every horizontal and vertical interconnect intersection).

▪ If the Logic Module at the end of a net is less than two rows away
from the driver module, a connection requires 2 antifuses, 1
vertical track, and 2 horizontal segments.

▪ If the modules are more than two rows apart, a connection


between them will require a long vertical track (LVT) together with
another vertical track (the output stub) and two horizontal tracks.
To connect these tracks will require a total of 4 antifuses in series
and this will add delay due to the resistance of the antifuses.
XILINX LCA
▪ The vertical lines and horizontal lines run between CLBs.
▪ The general-purpose interconnect joins switch boxes (also known
as magic boxes or switching matrices).
▪ The long lines run across the entire chip. It is possible to form
internal buses using long lines and the three-state buffers that are
next to each CLB.
▪ The direct connections (not used on the XC4000) bypass the
switch matrices and directly connect adjacent CLBs.
▪ The Programmable Interconnection Points (PIP’s) are
programmable pass transistors that connect the CLB inputs and
outputs to the routing network.
▪ The bidirectional (BIDI) interconnect buffers restore the logic level
and logic strength on long interconnect paths.
Xilinx LCA

Xilinx LCA interconnect. (a) The LCA architecture (notice the matrix element size is
larger than a CLB). (b) A simplified representation of the interconnect resources.
Each of the lines is a bus.
Components of interconnect delay in a Xilinx LCA array. (a) A portion of the interconnect
around the CLBs. (b) A switching matrix. (c) A detailed view inside the switching matrix
showing the passtransistor arrangement. (d) The equivalent circuit for the connection
between nets 6 and 20 using the matrix. (e) A view of the interconnect at a Programmable
Interconnection Point (PIP). (f) and (g) The equivalent schematic of a PIP connection.
ALTERA MAX INTERCONNECT SCHEME
• Altera MAX 5000/7000 devices use a Programmable Interconnect
Array ( PIA ).
• The PIA is a cross-point switch for logic signals traveling between
LABs.
• The advantages of this architecture is it uses a fixed number of
connections so the routing delay is also fixed.
• Simpler and regular structure in nature that improved speed of
the placement and routing software.
• The delay between LAB1 and LAB2 is the same as the delay
between LAB1 and LAB6
ALTERA MAX 5000 AND 7000 INTERCONNECT SCHEME

A simplified block diagram of the Altera MAX interconnect scheme. (a) The PIA
(Programmable Interconnect Array) is deterministic— delay is independent of the
path length. (b) Each LAB (Logic Array Block) contains a programmable AND array.
2/25/2022 26
PLACEMENT AND ROUTING
Placement
• Determine which logic block within an FPGA should implement each of the logic
blocks required by the circuit.
• The physical assignment of all blocks on the target FPGA in a way that minimizes one
or more specific objective cost functions (e.g., wirelength, delay etc.).
Objective:
• Minimize the required wiring (wire-length driven placement)
• Balance the wiring density across the FPGA (routability-driven placement)
• Maximize circuit speed (timing-driven placement)
3 major placement algorithms:
– min-cut (partitioning-based) placement
– simulated annealing based placement
– analytic placement
PARTITIONING BASED ALGORITHM
• Also referred to as min-cut methods
• The partitioning-based placement can be realized as recursively calling the
partitioning process by picking a region containing some circuit modules,
dividing the region into a set of subregions, and assigning each module to one of
the subregions to optimize some predefined metric (e.g., wirelength and cut
size).
• minimizing the number of cuts in the nets across the boundary between two
partitions
• placing highly-connected blocks in the same partition.
• These procedures are recursively repeated until the number of modules in each
region is smaller than a threshold
• A net is said to be cut if it connects components in one region to
components in other region
• The number of nets cut is the cut size
• The advantage of partitioning-based placement algorithms is that
they run very fast, efficient and has good scalability for handling
large-scale designs
• As they use a divide-and-conquer strategy, where large problems
are divided into small sub-problems, partitioning-based methods
significantly reduce the problem search space.
• Quality is often limited because of the lack of global information in
the top/coarse level and the lack of flexibility in the bottom/fine
one, especially when a design with large whitespaces.
• Moreover, since the cut size is not an exact function of wirelength,
timing or routability, the quality is not as good as other placement
strategies.
SIMULATED ANNEALING
• Simulated annealing (SA for short) is an optimization method which provides
a probability-based mechanism for “uphill” moves (i.e., a state/solution with a
higher cost) to escape from being trapped in a local minimum, where the
probability depends on the magnitude of the “uphill” move and the total
search time.
Because SA allows a
state of a higher cost
to replace its previous
state (i.e., an “uphill”
move), SA can escape
from a local minimum;
SA often can find a
high-quality solution
• For an SA-based placer, a solution is often given by the assignment of physical
locations for all the modules, and its solution space is a collection of all the
feasible assignments.
• By changing the location(s) of one module or more, we can identify a
neighboring state (a new placement) which is evaluated by a predefined cost
function to determine whether this neighboring state is kept.
• The well-known Versatile Place and Route (VPR) is a classical SA-based FPGA
placer
• SA is typically general and robust, very suitable for handling a design with
multiple objectives
• However, SA is often time-consuming. As the design complexity increases, the
runtime of an SA-based FPGA placement might be prohibitively long.
ANALYTICAL PLACEMENT
• Recently, a significant paradigm shift for FPGA global placement (and even
legalization) is moving from simulated annealing to analytical formulation
• Analytical placement computes the desired locations of modules under given
constraints with a mathematical formulation.
• key issues lie in the analytical models of wirelength and the integration and
optimization of objective functions
• Wirelength model:
• Half-perimeter wirelength (HPWL) of a net is the most popular wirelength model for
placement
• Mathematical formulation
• Quadratic wirelength models
• Squared Euclidean wirelength to approximate HPWL
• non-quadratic wirelength
• Weighted-average and log-sum-exponential models were proposed to approximate HPWL
• Quadratic models are intrinsically faster but less accurate, while non-quadratic models are
more accurate but slower.
• Integration:
• need to handle the simultaneous optimization for multiple objectives.
• As a result, it is desirable for an analytical formulation to integrate these objectives for
effective co-optimization.
• most popular penalty method first integrates two objective functions W and D (say,
wirelength and density, respectively) as
• W + λD,
• penalty multiplier λ
• Desired balance between W and D is achieved
Ref 11

FPGA-Specific Placement Issues


• The number of routing tracks in routing channels are fixed on a FPGA.
• Necessary condition for any feasible placement solution is the channel density in every channel cannot exceed the
number of routing tracks available in the channel.
• A FPGA contains routing tracks of various lengths.
• Simple interconnection delay estimation model based on net length or fanout are not accurate enough for use in
timing-driven FPGA placement algorithms.
• Fast and accurate interconnection delay computation methods are needed for timing- driven FPGA placement
FPGA ROUTING
• Once the locations for all the logic blocks in a circuit have been chosen, a router
assigns the nets of the circuit to the routing segments on the FPGA and determine
which programmable switches should be turned on to connect the logic blocks as
required by the circuit.
FPGA ROUTING
• Because of the high complexity involved in routing, routing is usually performed in
two phases: global routing and detailed routing
• Global routing
• Assigns “loose paths” to the nets
• balances the densities of all routing channels
• performs a coarse route to determine, for each connection, the minimum
distance path through routing channels
• Detailed routing assigns each net to specific routing segments in the channels as
restricted by the global router.
• By keeping track of the usage of each routing channel, congestion is avoided
FPGA ROUTING
• Detailed routing
• The detail router determines for each two point connection the specific
wiring segments to use in the routing channel assigned by the global router.
• Design of detailed routing algorithms heavily depends on the FPGA routing
architecture. logic modules
horizontal segments
• Row-based
track 1
• Segmented routing track vertical segments
track
• Matrix- based
• Switches
• Most commonly used algorithm : Maze routing
MAZE ROUTING
• Maze routing models the routing surface as a grid.
• Each grid point can be a terminal of a desired connection (known as either the source or target), a
wire that connects adjacent grid points, or an obstacle that represents space that is not available
for interconnections.
• The grid is described by a two dimensional array which records the state of each grid point
• The Lee algorithm for maze routing is popular because it is guaranteed to find a shortest-path
connection if one exists.
• This algorithm operates in three phases. During the expansion phase, the algorithm searches
outward from the source terminal while labeling each node with its distance from the source.
• When the target is reached, the backtrace phase selects a path by following decreasing label
values and marks these as wires (which act as obstacles for later routings).
• The cleanup phase erases unused expansion labels.
• Expansion Phase
• Starts from S
• Next move to 4 neighbouring cells up, right, down and left and label them as 1 if empty
• All empty neighbours of 1-cells are then labelled as 2
• The process is repeated for all cells until T
TRACE BACK PHASE
• Traces back from T starting with the smallest label adjacent T and repeatedly
selecting a neighbour with a smaller label until S is reached
ROUTING WITH RIP-UP AND RE-ROUTE
• Maze router routes only one net at a time on a prioritised manner
• Due to congestion caused by previously routed nets, a router may not be able
to find a path for a net to one or more destinations
• Then router a can enter in rip-up mode
• Previously routed networks that are blocking the net in progress are
temporarily ripped up to remove obstacles for the net in progress
• Ripped up nets can be immediately rerouted and /or assigned large
weights for priority to prepare for the next routing trial
ROUTING RESOURCES
• The routing resources are composed of
• pre-fabricated wire segments
• programmable switches, where the wiring among logic modules and I/O cells
is user-programmable
• Routing resources of a Xilinx FPGA:
• Switch boxes
• allow wires to switch between vertical and horizontal wires
• Uses programmable switches inside a switch box
• Two features of SB:
1. Flexibility, Fs : defines for a wiring segment entering the S block the
number of other wiring segments it can be connected to
2. Topology: The pattern of connection
• The topology of the S blocks is very important since it is possible to
choose two different topologies with the same flexibility Fs that result
in very different routabilities
• For example, figure shows that meanwhile topology 1 can’t connect
wire A with B, topology 2 can
ROUTING RESOURCES
• Wire segments : distinguished by their relative segment lengths
• length-1 wires (Single-length lines)
• intended for relatively short connections among CLBs, and they span through
one CLB only
• form a grid of horizontal and vertical wires that intersect at switch boxes
• length-2 wires (Double-length lines )
• contain a grid of segments twice as long as the length-1 wires
• similar to the Single-length Lines, except that each one spans two CLBs,
offering lower routing delays for moderately long connection
• long wires
• grid of segments that run the entire vertical or horizontal channel
• They are appropriate for connections that require reaching several CLBs
with low-skew
ROUTABILITY OF FPGA
• Routability :
• Measure of the probability that the placement and routing tool can
successfully complete the routing of the design
• Three dominant factors affecting routability
• γ : pins per logic cell ratio
• Amount of congestion or traffic in and out of a logic cell
• Β : pins per net ratio
• Degree of branching of multipin connections
• L : average wire length
• Determined by placement and routing tools
• Average channel width requirement of the design, W
1 𝛽−2
𝑊= 𝛾. 1 + 𝐿
2 𝛽
• Good practice to avoid futile placement and routing effort is to check
whether the estimated channel width requirement exceeds the routing
capacity available
• Both γ and β are known after the technology mapping process in
design flow
• L can be estimated, typically ranges from 1 to 2
EXAMPLE
• A design has average pin-per-cell ratio 6.15, and a pin-per-net ratio of
4.64. Average wire length can be taken as 1.5. Is it possible to implement this
with routing in an XC 3200 LCA which has 5 horizontal routing tracks

• W=?
DELAY
• Different parts of a circuit contribute to path delays : I/O pads, the logic blocks
and the interconnects

• Source and sink can be an I/O block or a logic block


• For Xilinx FPGAs , I/O blocks and logic blocks have constant delays
• I/O blocks : 15 ns
• Combinational logic blocks : 8 ns
• Routing delays vary quite a lot
ROUTING DELAY \ INTERCONNECT DELAY
• Interconnection delays :
• Delay due to programmable interconnection points (PIPs)
• Wiring segments
• Signal restoring buffers
• Long lines

• PIP is modelled as resistor


• Wire segment modelled as simple RC network

wire

• Longlines with pull-up resistors and pull-down resistor as given below


DELAY CALCULATION
• 40-60% of net delay is due to routing delay
• Delay of FPGA design can be calculated using delay calculators
• Example : xdelay, xact
• Elmore delay model
• simple approximation to the delay through an RC network in an
electronic system.
• Used to compute delays in tree structured networks
ELMORE DELAY EXAMPLE 1

• Elmore delay for each of the node


ELMORE DELAY EXAMPLE 2
FPGA APPLICATIONS
• FPGAs can be used in almost all of the applications that currently use Mask-Programmed Gate
Arrays, PLDs and small scale integration (SSI) logic chips. A few categories of such designs are
listed below:
• Application-Specific Integrated Circuits (ASICs) :
• An FPGA is a completely general medium for implementing digital logic. They are particularly suited
for implementation of ASICs. Some examples of such use that have been reported are: a 1 megabit
FIFO controller, an IBM PS/2 micro channel interface, a DRAM controller with error correction, a
printer controller, a graphics engine, a Tl network transmitter receiver as well as many other
telecommunications applications, and an optical character recognition circuit.
• Implementation of Random Logic :
• Random logic circuitry is usually implemented using PALs. If the speed of the circuit is not of critical
concern (PALs are faster than most FPGAs), such circuitry can be implemented advantageously with
FPGAs. Currently, one FPGA can implement a circuit that might require ten to twenty PALs. In the
future, this factor will increase dramatically.
• Replacement of SSI Chips for Random Logic :
• Existing circuits in commercial products often include a number of SSI chips. In many cases these
chips can be replaced with FPGAs, which often results in a substantial reduction in the required area
on circuit boards that carry such chips.
FPGA APPLICATIONS
• Prototyping
FPGAs are almost ideally suited for prototyping applications. The low cost of
implementation and the short time needed to physically realize a given
design, provide enormous advantages over more traditional approaches for
building prototype hardware. Initial versions of prototypes can be
implemented quickly and subsequent changes in the prototype can be done
easily and inexpensively.
• Rapid prototyping of large systems is done by using boards with multiple FPGAs and
plugging multiple boards into a backplane (motherboard).
• FPGA-Based Compute Engines
A whole new class of computers has been made possible with the advent of in-circuit re-programmable FPGAs.
These machines consist of a board of such FPGAs, usually with the pins of neighboring chips connected. The idea
is that a software program can be "compiled" (using high-level, logic-level and layout-level synthesis techniques,
or by hand) into hardware rather than software. This hardware is then implemented by programming the board of
FPGAs. This approach has two major advantages: first, there is no instruction fetching as required by traditional
microprocessors, as the hardware directly embodies the instructions. This can result in speedups of the order of
100. Secondly, this computing medium can provide high levels of parallelism, resulting in a further speed increase.
The Quicktum company provides such a product tuned towards the simulation emulation of digital circuits. Also,
Algotronix Ltd. sells a small add-in board for IBM PCs that can perform this function. At the research level, the
Digital Equipment Corporation in Paris has achieved performance ranging from 25 billion operations per second up
to 264 billion operations per second on applications such as RSA cryptography, the discrete cosine transform, Ziv-
Lempel encoding and 2-D convolution, among others.
• On-Site Re-configuration of Hardware
FPGAs are also attractive when it is desirable to be able to change the structure of a given machine that is already in
operation. One example is computer equipment in a remote location that may have to be altered on site in order to
correct a failure or perhaps a design error. A board that features a number of FPGAs connected via a programmable
interconnection network allows a high degree of flexibility in augmenting the functional behavior of the circuitry
provided by the board. Note that the most suitable type of FPGA for this kind of application is one that contains re-
programmable switches.
• FPGAs As Final Product in Medium-Speed Systems
Circuits realized using FPGAs typically operate in the 150–200 MHz clock
rate. For applications where this speed is sufficient, FPGAs can be used for
the final product itself as opposed to the prototype. When an FPGA is used as
the final product, enhancements to the system can be done as software
updates rather
than as hardware changes. Modern FPGA speeds are adequate for many
applications.
Glue Logic
FPGAs have become the medium of choice for implementing interface or glue logic between
modules and components. Small changes in interface protocols or formats would conventionally
necessitate building new interface logic. With SRAM FPGAs, the new interface logic can be
implemented on the same FPGA as in a software update.
Hardware Accelerators/Coprocessors
A software application running on a conventional system can be accelerated if a
coprocessor/accelerator can implement some key routines/kernels from the application in hardware.
An FPGA can be used to implement the key kernel. An SRAM-based, reconfigurable FPGA is well
suited for this use, because depending on the application running, different kernels can dynamically
be programmed into the FPGA. This approach has been demonstrated for applications such as
pattern matching. FPGA-based hardware is used for several applications, including computer
architecture simulator acceleration, emulation boards, hardware test/ verification, among others.

You might also like