You are on page 1of 48

August 2005, University of Strathclyde, Scotland, UK For Academic Use Only

The DSP Primer 8


FPGA Technology
DSPprimer Notes DSPprimer Home
Return Return
THIS SLIDE IS BLANK
Top
August 2005, For Academic Use Only, All Rights Reserved
Introduction 8.1
This module will give a top-down overview of FPGA Technology
based on various Xilinx devices;
At the end of the section, the following will have been covered:
FPGA Technology Roadmap and the various
devices available - how FPGAs are progressing
and what might lie ahead;
Performance and flexibility - how FPGAs
compare to DSP Processors and ASICs and
why FPGAs have the advantage;
FPGA Structure - a top down look at what an
FPGA consists of down to the low level
elements;
Introduction to the FPGA design flow - an
indication of the engineering process required
to implement a design;
How the digital logic of a design actually
operates within the FPGA;
Why pipelining and flip-flops/registers are free
and are required for high clock rates;
Memory available to designers within FPGAs
and the different types/options available;
How signals and clocks are effectively routed
throughout the device;
Input/Output interfacing capabilities of FPGAs;
Dedicated arithmetic hardware.
Notes:
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Technology Trends 8.2
General trend is bigger and faster;
This is being achieved by increases in device density through ever
smaller fabrication process technology;
New generations of FPGAs are geared towards implementing entire
systems on a single device;
Features such as RAM, dedicated arithmetic hardware, clock
management and transceivers are available in addition to the main
programmable logic;
FPGAs are also available with embedded processors (embedded in
silicon or as cores within the programmable logic fabric);
Notes:
FPGAs are being incorporated as central processing elements in many applications such as consumer
electronics, automotive, image/video processing, military/aerospace, base-stations, networking/
communications, supercomputing and wireless applications.
The inclusion of embedded (i.e. actually present in silicon - not as soft IP) Power PC processors in recent Xilinx
devices makes design partitioning and implementing much easier. Many low-speed algorithms that involve a lot
of decision making and jumps in execution are more suited to implementation by microprocessor than FPGA.
The inclusion of the Power PC blocks by Xilinx is an acknowledgement of this and goes a long way to making
the System on an FPGA goal possible.
Manufacturers may also provide embedded processors as soft IP cores. These cores are implemented on the
main programmable logic fabric and associated development kits allow designers to write code to be executed.
Features such as dedicated arithmetic hardware, clock management and multi-standard, high speed I/O blocks
all assist the engineer in implementing a given design. Problems associated with such features that plague
ASIC (Application Specific Integrated Circuit) designers such as clock skew have all been solved by the FPGA
manufacturer and can be essentially ignored by the FPGA engineer.
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Families 8.3
Flagship FPGA families (e.g. Xilinx Virtex-4) are aimed at implementing
large systems on a single device;
Flagship families are the biggest and most expensive and are not
aimed at high volume applications where cost is a primary factor;
High volume applications (i.e. where an ASIC would traditionally have
been used) are catered for by cheaper FPGA families (e.g. Xilinx
Spartan-3);
High volume devices often contain the same features offered by the
flagship devices at a smaller scale to control costs;
Within FPGA families, multiple device sizes are available at scaling
costs with associated scaling of features such as logic fabric, RAM, I/O
pins, arithmetic hardware etc.
Notes:
Often, low cost, high volume FPGA families are derived directly from larger families making the design process
more familiar (e.g. Spartan-3 from Virtex-II, Spartan-II from Virtex)
Each FPGA family comes in different sizes/packages and speed grades. The exact device required will depend
on factors related to requirements of the target design/application such as:
Area;
Data/sampling rates;
Input/Outputs and associated data rates;
Memory required;
Requirement for embedded processor or not;
Cost ($$$).
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Performance and Flexibility (I) 8.4
Performance of FPGAs is difficult to quantify because algorithms/
systems can be flexibly implemented in many different ways;
Multiply Accumulate (MAC) performance on flagship devices from
Xilinx is in the region of hundreds of GMACs per second running at
speeds of a few hundred MHz;
FPGA manufacturers often give figures for maximum MAC/s using
every piece of logic capable of multiplication - this of course does not
reflect typical systems implemented on FPGAs;
What is clear is that, due to parallelism, FPGAs easily outperform DSP
Processors in terms of data/arithmetic throughput and flexibility;
DSP Processors still have their place though - their design flow is better
understood within the engineering community and some baseband
algorithms do not yet map well to the FPGA fabric;
Notes:
MIPS (Millions of Instructions Per Second or perhaps Meaningless Indicator Of Performance) is often used to
compare DSP Processors but cannot be used to quantify overall FPGA performance.
The problem is that FPGAs are flexible enough to implement algorithms in different ways to suit the
requirements of a particular application.
For example, an application that requires 10 MACs (Multiply Accumulates) can be implemented on an FPGA
or a DSP processor. The FPGA could implement the hardware to perform the 10 MACs one after the other in
serial taking 10 clock cycles or in parallel, taking 1 clock cycle. Indeed it is possible to perform the 10 MACs in
5 clock cycles, or 2 clock cycles - as required. A DSP Processor does not have as much flexibility.
Why is this flexibility useful? The reason is because, if the 10 MACs must be performed quickly, the FPGA can
use a lot of area and perform them in parallel in 1 clock cycle and if the 10 MACs can be done slowly (defined
by the system performance requirements), the FPGA can perform them serially using a 10th of the area but
taking 10 clock cycles - i.e. the FPGA hardware implementation can be tailored to the application and take
advantage of the application requirements/specification.
In this way, speed and area can be traded when implementing on FPGA - DSP processors do not have this
option.
It should also be noted that it is very unlikely that anyone would ever implement an FPGA design that consisted
only of multipliers! Figures given by manufacturers are merely intended to give an idea of the potential
performance of these devices and by how far they outperform DSP Processors (considerably!)
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Performance and Flexibility (II) 8.5
Notes:
More on DSP Processors vs FPGAs.
It must be remembered that an FPGA is still an ASIC - Xilinx. are manufacturers of FPGAs but they are still fully
custom integrated circuits at the end of the day - even though they are a special case due to the fact they are
highly programmable...
DSP Processors are also ASICs and as ASIC process technology improves and chips get faster, DSP
Processors will get faster...
but so will FPGAs because they are ASICs too! FPGAs already hold a performance advantage gap over DSP
Processors and this gap will not close as silicon processes get better.
Diagram: FPGAs: DSP for Consumer Digital Video Applications, Xilinx, http://www.xilinx.com/esp/dvt/
collateral/fpga_dsp_adv_in_dvt.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Performance and Flexibility (III) 8.6
Notes:
A rather hand-wavy diagram that gives an indication of where FPGAs lie in the grand scheme of things in
relation to Custom ICs (ASICs) and DSP Processors.
The surge in FPGA use by manufacturers of electronic systems does seem to indicate that this diagram is close
to the mark however.
The costs and time involved in manufacturing ASICs are prohibitive (especially if bugs are found) when a
designer can have a design running in hardware on an FPGA at their desk and iterate the design as many times
as required with no expensive fabrication in sight!
Diagram: FPGAs: DSP for Consumer Digital Video Applications, Xilinx, http://www.xilinx.com/esp/dvt/
collateral/fpga_dsp_adv_in_dvt.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
FPGA Design Flow 8.7
This is a highly simplified overview
of the Xilinx FPGA design flow;
Numerous file format conversions
occur between the many pieces of
software;
The engineer can control and
influence all stages of the process
via constraints and options;
The FPGA market contains many
companies that produce software
tools for various stages of the flow;
The final bitstream configures
every part of the device required
for the implemented design.
Notes:
A more detailed design flow is given below - this doesnt even show all of the possible stages although it does
contain most! It may become clear why the FPGA design flow produces so many files and directories when
you consider all of the processes below. Several stages are grouped/automated and can be run by the high-
level software tools if desired. The engineer usually has the option of running each stage manually however!
Flow diagrams: Xilinx Software Manuals,
http://toolbox.xilinx.com/docsan/xilinx5/
manuals.htm
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Pro FPGA Architecture 8.8
High-level, generic view of the
Xilinx Virtex-II Pro family;
As device size increases, so does
the amount of available resources
such as embedded multipliers,
processors and configurable logic;
The CLBs (Configurable Logic
Blocks) form the main
programmable fabric of the device;
DCMs (Digital Clock Managers)
solve clock management issues
such as skew, phase shifting and
division;
Larger devices also contain more
user I/O pins and I/O functionality.
Notes:
An FPGA is rather abstract looking and it may not appear obvious how a user design maps to the actual
hardware. Luckily, the software tools can take care of a lot of the complexity of doing this once the user has
defined their design. There is still a considerable amount of work for the engineer however and this is especially
true when pushing the limits of the hardware - at this point the software tools may not do a good enough job
and the engineer must get in and around the nuts and bolts themselves!
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Configurable Logic Blocks 8.9
One Xilinx Virtex-II CLB contains
four slices (Virtex/Spartan series
have two slices per CLB);
Any digital logic design can be
implemented within the slice logic
housed by the CLBs;
Slices are interconnected within
their CLBs and via the switch
matrix that links CLBs together;
The Cin and Cout signals are significant because they are highly
useful for implementing arithmetic functions. Two independent Cin/
Cout columns exist per CLB column;
One slice can implement a 2-bit full adder so one CLB can implement
two independent 4-bit full adders as part of a larger bit-width
calculation with other CLBs as required.
Notes:
Once the user has entered their design (via VHDL/Verilog for example), the Synthesis process takes the
design and works out how to implement it on the elements of a specific FPGA. The engineer specifies exactly
which device to target (i.e. manufacturer, family, size, package type, speed grade). The synthesis process is a
complex one that can turn any synthesiseable VHDL/Verilog into a form that can be taken to FPGA by further
software tools.
In the case of Xilinx, the Synthesis tool will decide how to perform the digital logic operations of the design using
the slice logic available. The FPGA manufacturer tools then take the design through many more stages in order
to get the design into a form from which a bitstream is produced that can be downloaded to an FPGA to
configure it.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Slices (I) 8.10
The majority of user-design
functionality will be implemented by
the slices contained by the CLBs;
For this reason, the primary
measure of Xilinx FPGA device size
is the number of slices present;
Many interconnection possibilities
exist between slice elements
(connections and many elements
not shown here);
The Look Up Tables (LUTs) implement any 4-input boolean function -
the majority of a user digital logic design will be implemented using the
4-input LUTs to perform the actual logic operations;
LUTs can also be used as Shift-Registers or RAM - discussed later.
Notes:
Xilinx slices are where the actual work that implements the user design happens. The different elements can
be interconnected in different ways as determined by the configuration bitstream.
The number of slices available on a device essentially determine its capacity since this is where it all happens!
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Slices (II) 8.11
The registers provide the means of
implementing synchronous logic;
Registers are vital when designing
for high clock rates - failure to use
them will not yield high speed
performance;
The multiplexers and CY
components provide some of the
routing possibilities for signals
through the slice (shown in more
detail later);
The Arithmetic Logic AND gate at
the bottom has been included to make implementing multiplication
more efficient.
Notes:
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Slice (top half) 8.12
Notes:
All of the interconnections and components are shown.
The software tools will take care of configuring every required element/connection - the user can also do so
manually if required!
When the FPGA is configured with a bitstream (generated by the software tools), the contents of the LUTs and
the routing between the slice elements is defined - forming the user design. The bitstream will also configure
the connection between slices/CLBs etc.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Registers and Pipelining 8.13
LUT
D Q
LUT LUT
D Q
Slow Clock
LUT
D Q D Q
LUT
D Q
LUT
D Q
Fast Clock
Without Pipelining
With Pipelining
Possible FPGA clock rate is limited by the longest path between
registers because the signals must travel further through LUTS/wires;
Using the free slice registers keeps the longest path as short as
possible and hence the possible clock rate as high as possible.
Longest/Critical Path
Notes:
This is one of the fundamental design principles of FPGA design and must be understood.
On each clock edge, signals must travel through their data path via routing lines, LUTs, MUXes etc. before
arriving at the next flip-flop. This happens to signals within a design all over the device on every clock edge.
Some signals will have further to travel than others and the longest (time) path between two flip-flops/registers
is known as the critical path. It should be noted that the flip-flops are essentially free because every LUT is
paired with a flip-flop that can register the LUT output as required.
It is this critical path that will determine the maximum clock rate that the FPGA can be clocked at. Remember
that the user can choose the clock rate arbitrarily as required. If the critical path is too long, the design may not
be able to be clocked fast enough to meet the specification of the application!
In this case, the engineer must return to the software tools/their design and try and make the design run faster.
This may be achieved by for example: pipelining, redesign, increasing the effort level of the software tools,
adding/removing design constraints or manually editing the design in order to optimise the hardware and reduce
the length of the critical path!
It should be noted that this is the most difficult part of FPGA design - what to do if a design does not meet timing!
There are many options for the engineer to try and knowing which one(s) to use (and how to use them) can be
a bit of a black art...
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Block RAM 8.14
Xilinx Virtex-II devices have
dedicated 18 Kb (Kilo-bit) Block
RAMs throughout the device;
One of the largest Virtex-II Pro
(XC2VP125) has 556 Block RAMs
and so 556 * 18 = 10,008 Kb of
Block RAM in total;
Block RAM can be written at device
configuration time or written/read
during operation;
Block RAM can be single or dual
port - i.e. one address gives 2
pieces of data - excellent for DSP
(sample and coefficient for ex.).
Notes:
Engineers specify how they want to use the RAM components from within their VHDL/Verilog code - the
software tools then ensure that the actual hardware is made available to the design.
An example of using Block RAM could be to store the numeric values required to modulate a signal by a sine
wave.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Distributed RAM 8.15
A LUT can store 16 bits and can be
used as a 16x1 RAM;
Two LUTs can form one 32x1
single-port RAM or one 16x1 dual-
port RAM - i.e. the same address
produces data from both RAMs;
This flexibility allows several single/
dual port RAM configurations of the
128 bits available within one CLB (4
slices * 2 LUTs * 16 bits = 128);
A Virtex-II Pro with 55,616 slices
therefore has 55,616 * 2 LUTs * 16
bits = 1,738 Kb of Distributed RAM;
The ability to create small
RAMs anywhere on the device
is extremely useful - especially
for DSP purposes.
Notes:
An example of using a small distributed RAM could be a chipping sequence for use in a communications
system. The sequence would be stored where it is needed to chip data as it proceeds through the system.
The ability to form larger single/dual port configurations from the smaller ones is further testament to FPGA
flexibility - distributed RAMs need only be as large as required.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Shift Registers 8.16
Xilinx LUTs can implement a 16-bit shift register (called an SRL16)
and when combined with the register available to every LUT, 17
delays are possible in one half of a slice;
Shift registers can be cascaded to form longer delays;
The delay can be tapped at any point using the address lines to create
delay lines of length less than the maximum.
Shift Reg
A0
A1
A2
A3
CLK
D Q D
CLK
Q
Notes:
The diagram opposite shows the SRL16s being cascaded to form a larger delay
line.
Note the flexibility of the Xilinx LUTs - this is the 3rd mode they can operate in
addition to LUT/RAM.
Diagram opposite: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx,
http://direct.xilinx.com/bvdocs/publications/ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-4 DSP48 Slice 8.17
The Xilinx Virtex-4
DSP48 slice offers
custom DSP
functionality;
500MHz throughput
However, the
Transposed/
Systolic FIR
structures map
more effectively in
this case;
Summation
feedback is also
available for serial
implementations;
Notes:
The Virtex-4 DSP48 slice caters for two types of full-parallel FIR - Systolic and Transposed. The Systolic
structure allows the highest performance due to maximum pipelining and no high input signal fanout. The
Transposed structure has a fixed, low latency compared to the Systolic (whose latency increases with filter
length) but the input signal fanout can limit performance, especially for large filters. Both architectures can be
entirely implemented within DSP48 slices with no external logic.
Diagrams: XtremeDSP Design Considerations User Guide, http://www.xilinx.com
Full-Parallel
Transposed FIR
Full-Parallel
Systolic FIR
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Embedded Multipliers 8.18
Embedded multipliers are arranged
in columns between CLBs;
Multipliers are 18 x 18 bit and are
associated with BlockRAM for easy
access to data;
Can be combinatorial or pipelined
running at over 300MHz;
Combining embedded multipliers
with LUT implemented accumulators
allows MAC engines to be created
(e.g. for use in filters);
Cascade multipliers to implement
larger width multiplications.
Notes:
Each embedded multiplier is associated with an adjacent BlockRAM and hence these elements share
interconnect. When the multiplier is being used without the associated BlockRAM, the BlockRAM can still be
used but with only 18 bits.
Again, multipliers can be implemented in the main fabric as required using purely slice logic or combining
BlockRAM and slice implemented multiplier blocks. This may be necessary if no embedded multipliers are
available or the design timing requirements are tight.
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II Routing 8.19
Xilinx Virtex-II series contains a multitude of routing that connects the
elements of the device together;
The configurable routing between CLBs (via the switch matrices) is
complemented by dedicated routing for clock signals, carry chains etc.
Notes:
Routing signals around the device is usually left to the tools to implement. There is a massive number of
possibilities to implement a design on an FPGA and the software tools may take many hours to actually produce
a bitstream for a reasonable design.
The routing possibilities are described as being hierarchical due to the fact that different routing options are
available depending on how far a signal has to travel. Clearly, keeping signals to as short routing distances as
possible is preferable to ensure high clock rates.
The dedicated clock distribution lines are of special importance because when combined with the DCM (Digital
Clock Management) blocks, they allow for high speed clocks to be fed throughout the device with no skew.
Diagram: Virtex-II Pro Platform FPGA Complete Data Sheet, Xilinx, http://direct.xilinx.com/bvdocs/publications/
ds083.pdf
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-II I/O 8.20
FPGAs are capable of interfacing with backplanes, buses and other
systems at a board/system level;
A multitude of current and emerging serial/parallel I/O standards are
supported;
In Virtex-II, up to 24 RocketIO Serial Transceiver blocks are available
operating at full-duplex speeds of up 3.125Gb/s each;
Also, in Virtex-II, user I/O pins support many single-ended and
differential signalling standards up to 840 Mbps LVDS (Low-Voltage
Differential Signalling);
Virtex-II Pro X family supports up to 20 channels at 10.3125 Gbp/s.
Notes:
Getting signals into and out of FPGAs requires high speed signals to be routed into and out of the device on
some sort of board that houses the overall system and the FPGA(s).
The usual board-level difficulties with signal cross-talk, inductance, resonance etc. still exist but interfacing the
FPGA to the board signals is quite achievable given the number of supported I/O standards:
The Virtex-II devices have dedicated RocketIO blocks to deal with high speed I/O requirements and many more
general Select I/O pins for other interfaces. The specific formats supported by each are given below:
Supported standards from:
http://www.xilinx.com/products/virtex2pro/rocketio.htm
http://www.xilinx.com/products/virtex2pro/selectioultra.htm
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx ASMBL Architecture 8.21
Advanced Silicon Modular Block - basis of Virtex-4;
Column based architecture with focused column types;
Mixing column types in different ratios allows application domains with
differing logic resource requirements to be more accurately targeted;
Individual resource types (e.g.
DSP/memory) can be scaled
independently of the die size;
Current FPGA architectures
scale resource types primarily
only with die size.
Notes:
Trivia: ASMBL was renamed to Advanced Silicon Modular Block from Application Specific Modular Block.
The diagram below further illustrates how logic resources/features can be scaled independently of die size
compared to traditional FPGA architectures.
Xilinx see ASMBL as the next stage in programmable logic evolution.
Diagrams: ASMBL Press Kit, Xilinx, http://www.xilinx.com/company/press/kits/asmbl.htm
Top
August 2005, For Academic Use Only, All Rights Reserved
Xilinx Virtex-4 Platforms 8.22
Designers can select the most appropriate device according to feature
requirements and cost;
DSP is now a major focus industry-wide!
Notes:
Top
August 2005, For Academic Use Only, All Rights Reserved
Conclusion 8.23
This module has presented an overview of FPGA technology to give a
high-level understanding of:
What features cutting-edge FPGAs contain and the general trend of larger, faster and more features to
support entire systems being implemented on FPGAs (e.g. I/O Transceivers, DSP blocks);
Why FPGAs provide performance and flexibility advantages over DSP Processors and ASICs due to
infinite reconfigurability, trading area for speed and performing operations in parallel as required;
Why FPGA performance is difficult to measure due to their inherent flexibility;
How the FPGA structure is generally organised hierarchically into CLBs/LABs, slices/LEs and elements
such as LUTS/RAMs/SRL16s, MUXes and flip-flops and how these elements are used/combined to
implement a design;
The memory available on FPGAs;
Dedicated arithmetic hardware and the various configurations available;
The hierarchical routing lines that connect blocks together across the device and provide clock routing;
The complexity of the FPGA design flow and the number of software tools and processes that can be
involved;
The various I/O standards available to allow FPGAs to interface with high-speed signals via board
signals/buses/backplanes etc.
Why flip-flops are free (they exist beside the LUTs anyway) and how they allow high clock rates.
Notes:

You might also like