Professional Documents
Culture Documents
The DSP Primer 8: FPGA Technology
The DSP Primer 8: FPGA Technology
FPGA Technology
Return Return
• At the end of the section, the following will have been covered:
• FPGA Technology Roadmap and the various • How the digital logic of a design actually
devices available - how FPGAs are progressing operates within the FPGA;
and what might lie ahead;
• Why pipelining and flip-flops/registers are “free”
• Performance and flexibility - how FPGAs and are required for high clock rates;
compare to DSP Processors and ASICs and
why FPGAs have the advantage; • Memory available to designers within FPGAs
and the different types/options available;
• FPGA Structure - a top down look at what an
FPGA consists of down to the low level • How signals and clocks are effectively routed
elements; throughout the device;
FPGAs are being incorporated as central processing elements in many applications such as consumer
electronics, automotive, image/video processing, military/aerospace, base-stations, networking/
communications, supercomputing and wireless applications.
The inclusion of embedded (i.e. actually present in silicon - not as soft IP) Power PC processors in recent Xilinx
devices makes design partitioning and implementing much easier. Many low-speed algorithms that involve a lot
of decision making and jumps in execution are more suited to implementation by microprocessor than FPGA.
The inclusion of the Power PC blocks by Xilinx is an acknowledgement of this and goes a long way to making
the “System on an FPGA” goal possible.
Manufacturers may also provide embedded processors as “soft” IP cores. These cores are implemented on the
main programmable logic fabric and associated development kits allow designers to write code to be executed.
Features such as dedicated arithmetic hardware, clock management and multi-standard, high speed I/O blocks
all assist the engineer in implementing a given design. Problems associated with such features that plague
ASIC (Application Specific Integrated Circuit) designers such as clock skew have all been solved by the FPGA
manufacturer and can be essentially ignored by the FPGA engineer.
Top
FPGA Families 8.3
• Flagship families are the biggest and most expensive and are not
aimed at high volume applications where cost is a primary factor;
• High volume devices often contain the same features offered by the
flagship devices at a smaller scale to control costs;
Often, low cost, high volume FPGA families are derived directly from larger families making the design process
more familiar (e.g. Spartan-3 from Virtex-II, Spartan-II from Virtex)
Each FPGA family comes in different sizes/packages and speed grades. The exact device required will depend
on factors related to requirements of the target design/application such as:
• Area;
• Data/sampling rates;
• Memory required;
• Cost ($$$).
Top
FPGA Performance and Flexibility (I) 8.4
• DSP Processors still have their place though - their design flow is better
understood within the engineering community and some baseband
algorithms do not yet map well to the FPGA fabric;
MIPS (Millions of Instructions Per Second or perhaps Meaningless Indicator Of Performance) is often used to
compare DSP Processors but cannot be used to quantify overall FPGA performance.
The problem is that FPGAs are flexible enough to implement algorithms in different ways to suit the
requirements of a particular application.
For example, an application that requires 10 MACs (Multiply Accumulates) can be implemented on an FPGA
or a DSP processor. The FPGA could implement the hardware to perform the 10 MACs one after the other in
serial taking 10 clock cycles or in parallel, taking 1 clock cycle. Indeed it is possible to perform the 10 MACs in
5 clock cycles, or 2 clock cycles - as required. A DSP Processor does not have as much flexibility.
Why is this flexibility useful? The reason is because, if the 10 MACs must be performed quickly, the FPGA can
use a lot of area and perform them in parallel in 1 clock cycle and if the 10 MACs can be done slowly (defined
by the system performance requirements), the FPGA can perform them serially using a 10th of the area but
taking 10 clock cycles - i.e. the FPGA hardware implementation can be tailored to the application and take
advantage of the application requirements/specification.
In this way, speed and area can be traded when implementing on FPGA - DSP processors do not have this
option.
It should also be noted that it is very unlikely that anyone would ever implement an FPGA design that consisted
only of multipliers! Figures given by manufacturers are merely intended to give an idea of the potential
performance of these devices and by how far they outperform DSP Processors (considerably!)
Top
FPGA Performance and Flexibility (II) 8.5
It must be remembered that an FPGA is still an ASIC - Xilinx. are manufacturers of FPGAs but they are still fully
custom integrated circuits at the end of the day - even though they are a special case due to the fact they are
highly programmable...
DSP Processors are also ASICs and as ASIC process technology improves and chips get faster, DSP
Processors will get faster...
but so will FPGAs because they are ASICs too! FPGAs already hold a performance advantage gap over DSP
Processors and this gap will not close as silicon processes get better.
Diagram: “FPGAs: DSP for Consumer Digital Video Applications”, Xilinx, http://www.xilinx.com/esp/dvt/
collateral/fpga_dsp_adv_in_dvt.pdf
Top
FPGA Performance and Flexibility (III) 8.6
A rather hand-wavy diagram that gives an indication of where FPGAs lie in the grand scheme of things in
relation to Custom ICs (ASICs) and DSP Processors.
The surge in FPGA use by manufacturers of electronic systems does seem to indicate that this diagram is close
to the mark however.
The costs and time involved in manufacturing ASICs are prohibitive (especially if bugs are found) when a
designer can have a design running in hardware on an FPGA at their desk and iterate the design as many times
as required with no expensive fabrication in sight!
Diagram: “FPGAs: DSP for Consumer Digital Video Applications”, Xilinx, http://www.xilinx.com/esp/dvt/
collateral/fpga_dsp_adv_in_dvt.pdf
Top
FPGA Design Flow 8.7
• This is a highly simplified overview
of the Xilinx FPGA design flow;
A more detailed design flow is given below - this doesn’t even show all of the possible stages although it does
contain most! It may become clear why the FPGA design flow produces so many files and directories when
you consider all of the processes below. Several stages are grouped/automated and can be run by the high-
level software tools if desired. The engineer usually has the option of running each stage manually however!
An FPGA is rather abstract looking and it may not appear obvious how a user design maps to the actual
hardware. Luckily, the software tools can take care of a lot of the complexity of doing this once the user has
defined their design. There is still a considerable amount of work for the engineer however and this is especially
true when pushing the limits of the hardware - at this point the software tools may not do a good enough job
and the engineer must get in and around the “nuts and bolts” themselves!
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II Configurable Logic Blocks 8.9
• One Xilinx Virtex-II CLB contains
four slices (Virtex/Spartan series
have two slices per CLB);
• The Cin and Cout signals are significant because they are highly
useful for implementing arithmetic functions. Two independent Cin/
Cout columns exist per CLB column;
• One slice can implement a 2-bit full adder so one CLB can implement
two independent 4-bit full adders as part of a larger bit-width
calculation with other CLBs as required.
Once the user has entered their design (via VHDL/Verilog for example), the “Synthesis” process takes the
design and works out how to implement it on the elements of a specific FPGA. The engineer specifies exactly
which device to target (i.e. manufacturer, family, size, package type, speed grade). The synthesis process is a
complex one that can turn any synthesiseable VHDL/Verilog into a form that can be taken to FPGA by further
software tools.
In the case of Xilinx, the Synthesis tool will decide how to perform the digital logic operations of the design using
the slice logic available. The FPGA manufacturer tools then take the design through many more stages in order
to get the design into a form from which a bitstream is produced that can be downloaded to an FPGA to
configure it.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II Slices (I) 8.10
• The majority of user-design
functionality will be implemented by
the slices contained by the CLBs;
Xilinx slices are where the actual “work” that implements the user design happens. The different elements can
be interconnected in different ways as determined by the configuration bitstream.
The number of slices available on a device essentially determine its capacity since this is where it all happens!
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II Slices (II) 8.11
• The registers provide the means of
implementing synchronous logic;
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II Slice (top half) 8.12
The software tools will take care of configuring every required element/connection - the user can also do so
manually if required!
When the FPGA is configured with a bitstream (generated by the software tools), the contents of the LUTs and
the routing between the slice elements is defined - forming the user design. The bitstream will also configure
the connection between slices/CLBs etc.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Registers and Pipelining 8.13
• Possible FPGA clock rate is limited by the longest path between
registers because the signals must travel further through LUTS/wires;
• Using the “free” slice registers keeps the longest path as short as
possible and hence the possible clock rate as high as possible.
Without Pipelining
D Q D Q
LUT LUT LUT
D Q D Q D Q D Q
LUT LUT LUT
Fast Clock
This is one of the fundamental design principles of FPGA design and must be understood.
On each clock edge, signals must travel through their data path via routing lines, LUTs, MUXes etc. before
arriving at the next flip-flop. This happens to signals within a design all over the device on every clock edge.
Some signals will have further to travel than others and the longest (time) path between two flip-flops/registers
is known as the “critical path”. It should be noted that the flip-flops are essentially free because every LUT is
paired with a flip-flop that can register the LUT output as required.
It is this critical path that will determine the maximum clock rate that the FPGA can be clocked at. Remember
that the user can choose the clock rate arbitrarily as required. If the critical path is too long, the design may not
be able to be clocked fast enough to meet the specification of the application!
In this case, the engineer must return to the software tools/their design and try and make the design run faster.
This may be achieved by for example: pipelining, redesign, increasing the effort level of the software tools,
adding/removing design constraints or manually editing the design in order to optimise the hardware and reduce
the length of the critical path!
It should be noted that this is the most difficult part of FPGA design - what to do if a design does not meet timing!
There are many options for the engineer to try and knowing which one(s) to use (and how to use them) can be
a bit of a black art...
Top
Xilinx Virtex-II Block RAM 8.14
• Xilinx Virtex-II devices have
dedicated 18 Kb (Kilo-bit) Block
RAMs throughout the device;
Engineers specify how they want to use the RAM components from within their VHDL/Verilog code - the
software tools then ensure that the actual hardware is made available to the design.
An example of using Block RAM could be to store the numeric values required to modulate a signal by a sine
wave.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II Distributed RAM 8.15
An example of using a small distributed RAM could be a chipping sequence for use in a communications
system. The sequence would be stored where it is needed to “chip” data as it proceeds through the system.
The ability to form larger single/dual port configurations from the smaller ones is further testament to FPGA
flexibility - distributed RAMs need only be as large as required.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Shift Registers 8.16
• Xilinx LUTs can implement a 16-bit shift register (called an SRL16)
and when combined with the register available to every LUT, 17
delays are possible in one half of a slice;
• The delay can be tapped at any point using the address lines to create
delay lines of length less than the maximum.
D Q D Q
A3 CLK
Shift Reg
A2
A1
A0
CLK
The diagram opposite shows the SRL16s being cascaded to form a larger delay
line.
Note the flexibility of the Xilinx LUTs - this is the 3rd mode they can operate in
addition to LUT/RAM.
Diagram opposite: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx,
http://direct.xilinx.com/bvdocs/publications/ds083.pdf
Top
Xilinx Virtex-4 DSP48 Slice 8.17
• The Xilinx Virtex-4
DSP48 slice offers
custom DSP
functionality;
• 500MHz throughput
• However, the
Transposed/
Systolic FIR
structures map
more effectively in
this case;
• Summation
feedback is also
available for serial
implementations;
Full-Parallel
Transposed FIR
Full-Parallel
Systolic FIR
Each embedded multiplier is associated with an adjacent BlockRAM and hence these elements share
interconnect. When the multiplier is being used without the associated BlockRAM, the BlockRAM can still be
used but with only 18 bits.
Again, multipliers can be implemented in the main fabric as required using purely slice logic or combining
BlockRAM and slice implemented multiplier blocks. This may be necessary if no embedded multipliers are
available or the design timing requirements are tight.
Top
Xilinx Virtex-II Routing 8.19
• Xilinx Virtex-II series contains a multitude of routing that connects the
elements of the device together;
Routing signals around the device is usually left to the tools to implement. There is a massive number of
possibilities to implement a design on an FPGA and the software tools may take many hours to actually produce
a bitstream for a reasonable design.
The routing possibilities are described as being hierarchical due to the fact that different routing options are
available depending on how far a signal has to travel. Clearly, keeping signals to as short routing distances as
possible is preferable to ensure high clock rates.
The dedicated clock distribution lines are of special importance because when combined with the DCM (Digital
Clock Management) blocks, they allow for high speed clocks to be fed throughout the device with no skew.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
Xilinx Virtex-II I/O 8.20
The usual board-level difficulties with signal cross-talk, inductance, resonance etc. still exist but interfacing the
FPGA to the board signals is quite achievable given the number of supported I/O standards:
The Virtex-II devices have dedicated RocketIO blocks to deal with high speed I/O requirements and many more
general Select I/O pins for other interfaces. The specific formats supported by each are given below:
The diagram below further illustrates how logic resources/features can be scaled independently of die size
compared to traditional FPGA architectures.
• Why FPGAs provide performance and flexibility advantages over DSP Processors and ASICs due to
infinite reconfigurability, trading area for speed and performing operations in parallel as required;
• How the FPGA structure is generally organised hierarchically into CLBs/LABs, slices/LEs and elements
such as LUTS/RAMs/SRL16s, MUXes and flip-flops and how these elements are used/combined to
implement a design;
• The hierarchical routing lines that connect blocks together across the device and provide clock routing;
• The complexity of the FPGA design flow and the number of software tools and processes that can be
involved;
• The various I/O standards available to allow FPGAs to interface with high-speed signals via board
signals/buses/backplanes etc.
• Why flip-flops are “free” (they exist beside the LUTs anyway) and how they allow high clock rates.