04 FPGA Updated

4/4/2011
EE 811
Advanced Digital System Design
Dr. Arshad Aziz
Basic FPGA Architecture
Technology Timeline
1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Transistors
ICs (General)
SRAMs & DRAMs
Microprocessors
SPLDs
CPLDs
ASICs
FPGAs
The Design Warrior’s Guide to FPGAs

Devices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
1
4/4/2011
Major FPGA vendors

SRAM-based FPGAs
Xilinx Inc.
Inc – www.xilinx.com
www xilinx com
Altera Corp. – www.altera.com
Atmel Corp. – www.atmel.com
Lattice Semiconductor Corp.–
www.latticesemi.com
Antifuse
A tif andd fl
flash-based
hb d FPGA
FPGAs
Actel Corp. – www.actel.com
QuickLogic Corp. – www.quicklogic.com
E2PROM /
Feature SRAM Antifuse
FLASH
One or more One or more
Technology node State-of-the-art
generations behind generations behind
Yes Yes (in-system
Reprogrammable No
(in system) or offline)
Reprogramming
3x slower
speed (inc. Fast ----
than SRAM
erasing)
Volatile (must
No
be programmed Yes No
(but can be if required)
on power-up)
Requires external
Yes No No
configuration file
Good for Yes Yes
No
prototyping (very good) (reasonable)
Instant-on No Yes Yes
Acceptable
IP Security (especially when using Very Good Very Good
bitstream encryption)
Size of Large Medium-small

Very small
configuration cell (six transistors) (two transistors)
Power
Medium Low Medium
consumption
Rad Hard No Yes Not really
2
4/4/2011
The Programmable Marketplace

Q1 Calendar Year 2005
PLD Segment FPGA Sub-Segment
L tti
Lattice QuickLogic:
Q i kL i 2% Xilinx
Actel
Other: 2%
5% 7%
58%
33% 51%
31% 11%
Altera Xilinx Altera All Others
Source: Company reports

Latest information available; computed on a 4-quarter rolling basis
FPGA Families
Low-cost High-performance
– Spartan 3 Virtex 4 LX / SX / FX
– Spartan 3E Virtex 5 LX
– Spartan 3L
Xilinx
Cyclone II Stratix II
Stratix II GX
Altera
3
4/4/2011
Xilinx
• Primary products: FPGAs and the associated CAD
software
Programmable
Logic Devices ISE Alliance and Foundation
Series Design Software
• Main headquarters in San Jose, CA

• Fabless* Semiconductor and Software Company
 UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996}
 Seiko Epson (Japan)
• TSMC (Taiwan)
Source: [Xilinx Inc.]
Xilinx
• Primary products: FPGAs and the associated CAD software
Programmable
Logic Devices ISE Alliance and Foundation
Series Design Software
• Main headquarters in San Jose, CA
• Fabless* Semiconductor and Software Company
• UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in
1996}
• Seiko Epson (Japan)
• TSMC (Taiwan)
4
4/4/2011
Xilinx FPGA Families

• Old families
– XC3000, XC4000, XC5200
– Old 0.5µm, 0.35µm and 0.25µm technology. Not
recommended for modern designs.
• Low
L Cost
C tF Family
il
– Spartan/XL – derived from XC4000
– Spartan-II – derived from Virtex
– Spartan-IIE – derived from Virtex-E
– Spartan-3 (90 nm)
– Spartan-3E (90 nm)
– Spartan-3A (90 nm)
• High
High-performance
performance families
– Virtex (220 nm)
– Virtex-E, Virtex-EM (180 nm)
– Virtex-II, Virtex-II PRO (130 nm)
– Virtex-4 (90 nm)
– Virtex 5 (65 nm)
General structure of an FPGA

5
4/4/2011
Xilinx FPGA
Configurable
Logic
Blocks
Block RAMs
Block RAMs
I/O
Blocks
Block
RAMs
Generic FPGA architecture:

Configurable
Logic Block (CLB
(CLB))
Connection
Block
Wire segments
Switch Block
Routing Channels
I/O pad
6
4/4/2011
Xilinx CLB
Configurable logic block (CLB)
Slice Slice
CLB CLB L i cell

Logic ll L i cell
Logic ll
Logic cell Logic cell
Slice Slice
CLB CLB Logic cell Logic cell
Logic cell Logic cell

Xilinx Point of Reference

• A Xilinx CLB has FOUR slices
– Each slice has TWO logic cells
– Each logic cell has TWO LUTs plus other
logic (carry and control) plus a flip-flop/latch
• For SLICEL slices, these LUTs can be
configured as:
1 LUT
1.
• For SLICEM slices, these LUTs can be
configured as:
1. LUT
2. 16 x 1 Distributed RAM (16 words x 1 bit/word)
3. 16-bit Shift Register
7
4/4/2011
CLB Structure of Spartan 3

• Each Virtex-II CLB COUT COUT
BUFT
contains BUF T
four slices Slice S3
– Local routing provides

feedback between slices Slice S2
Switch SHIFT
in the same CLB, and it Matrix
provides routing to
Slice S1
neighboring CLBs
– A switch matrix provides
Slice S0
access Local Routing
to general routing
CIN CIN
resources
Simplified view of a Xilinx Logic Cell

16-bit SR
16x1 RAM
a 4-input
p
LUT
b
y
c
mux
d flip-flop
q
e
clock
clock enable
set/reset

8
4/4/2011
Simplified Slice Structure

• Each slice has four
outputs
– Two registered outputs,
Slice 0
two non-registered outputs
– Two BUFTs associated LUT Carry D
PRE
Q
with each CLB, accessible CE
by all 16 CLB outputs CLR
• Carry logic runs vertically,

up only
LUT Carry D PRE
– Two independent CE Q
carry chains per CLB

CLR
Detailed Slice Structure

• The next few slides
discuss the slice
features
– LUTs
– MUXF5, MUXF6,
MUXF7, MUXF8
(only the F5 and
F6 MUX are shown
in this diagram)
– Carry Logic
– MULT_ANDs
– Sequential Elements
9
4/4/2011
SRAM Cell (Pass Transistor)

• An SRAM cell can drive the gate (G) terminal of an
NMOS transistor.
• If SRAM (M) = 1 then signals passes from S  D
• An SRAM cell can be attached to the select line of a
MUX to control it.
Look-Up Tables
• Combinatorial logic is stored in Look-Up A B C D Z
Tables (LUTs) 0 0 0 0 0
– Also called Function Generators (FGs)
0 0 0 1 0
– Capacity is limited by the number of inputs, not
0 0 1 0 0
by the complexity
0 0 1 1 1
• Delay through the LUT is constant
0 1 0 0 1
Combinatorial Logic 0 1 0 1 1
A . . .
B 1 1 0 0 0
Z
C 1 1 0 1 0
D
1 1 1 0 0
1 1 1 1 1
10
4/4/2011
Look Up Table (LUT)

• The LUT is used to realize any Boolean function.
• Assume the function to be realized is y = (a&b) | !c
• This could be achieved by loading the LUT with the
appropriate output values
LUT (Look-Up Table)

Functionality
x1
x2
y
• Look-Up tables
x1 x2 x3 x4 y x3 LUT x1 x2 x3 x4 y
0 0 0 0 1
x4
0 0 0 0 0 are primary
0 0 0 1 1 0 0 0 1 1
0 0 1 0 1 0 0 1 0 0 elements for
0 0 1 1 1 0 0 1 1 0
0
0
1
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
0
1
logic
0
0
1
1
1
1
0
1
1
1
0
0
1
1
1
1
0
1
0
1 implementation
1 0 0 0 1 1 0 0 0 0
1
1
0
0
0
1
1
0
1
1
1
1
0
0
0
1
1
0
1
0 • Each LUT can
1 0 1 1 1 1 0 1 1 0
1 1 0 0 0 1 1 0 0 1 implement any
1 1 0 1 0 x1 x2 x3 x4 1 1 0 1 1
1
1
1
1
1
1
0
1
0
0
1
1
1
1
1
1
0
1
0
0
function of 4
i
inputs
x1 x2
11
4/4/2011
5-Input Functions implemented

using two LUTs
• One CLB Slice can implement any function of 5 inputs
• L i ffunction
Logic i iis partitioned
ii dbbetween two LUT
LUTs
• F5 multiplexer selects LUT
LUT
A4
ROM
D
A3 RAM
A2
A1
WS DI F5
0
F5
1 X
WS DI GXOR
F4 A4 G
D
F3 A3
F2 A2 LUT
ROM
F1 A1
RAM
BX nBX
BX
1
0
5-Input Functions implemented using two

LUTs
X5 X4 X3 X2 X1 Y
0 0 0 0 0 0
0 0 0 0 1 1
0 0 0 1 0 0
0 0 0 1 1 0
0 0 1 0 0 1
0 0 1 0 1 1
0 0 1 1 0 0
0 0 1 1 1 0 LUT
0 1 0 0 0 1
0 1 0 0 1 0
0 1 0 1 0 0
0 1 0 1 1 1 OUT
0 1 1 0 0 1
0 1 1 0 1 1
0 1 1 1 0 1
0 1 1 1 1 1
1 0 0 0 0 0
1 0 0 0 1 0
1 0 0 1 0 0
1 0 0 1 1 0
1 0 1 0 0 0
1 0 1 0 1 0
1 0 1 1 0 0
1 0 1 1 1 1
1 1 0 0 0 0
1 1 0 0 1 1
LUT
1 1 0 1 0 0
1 1 0 1 1 1
1 1 1 0 0 0
1 1 1 0 1 1
1 1 1 1 0 0
1 1 1 1 1 0
12
4/4/2011
Dedicated Expansion Multiplexers
• MUXF5 combines 2 LUTs to create CLB
• Any 5-input function (LUT5) Slice
• Or selected functions up to 9 inputs LUT MUXF6
• Or 4x1 multiplexer LUT

MUXF5
• MUXF6 combines 2 slices to form Slice
• Any 6-input function (LUT6) LUT
• Or selected functions up to 19 inputs

LUT
• 8x1 multiplexer MUXF5
• Dedicated muxes are faster and

more space efficient
Connecting Look-Up Tables

MUXF8 combines the two
CLB
F8
MUXF7 outputs (from the CLB

above or below)
F5
Slice S3
MUXF6 combines slices S2
F6
and S3
F5
Slice S2
MUXF7 combines the two

F7
MUXF6 outputs
Slice S1
F5
F
MUXF6 combines slices S0 and S1

F6
Slice S0
F5
MUXF5 combines LUTs in each slice
13
4/4/2011
Programmable Logic Block

• Early devices were based on the concept of programmable
logic block, which comprised
• 3-input
3 input lookup table (LUT),
(LUT)
• register that could act as flip flop or a latch,
• multiplexer, along with a few other elements.
3-, 4-, 5-, or 6-input LUTs?

• The key feature of n-input LUT is that it can implement any
possible n-input combinational logic function.
• Adding more inputs allows you to represent more complex
functions, but every time you add an input, you double the
number of SRAM cells!
• The first FPGAs were based on 3-input LUTs.
• FPGA vendors and researchers studied the relative merits of 3,
4, 5 and even 6 input LUTS.
• The current consensus is that 4-input LUTS offer the optimal
balance of pros and cons.
• In the past, some devices were created using a mixture of
different LUT sizes because this offered the promise of optimal
device utilization.
• However current logic synthesis tools prefer uniformity and
regularity
14
4/4/2011
FPGA Function generators

• LUT Example: Implement the function
F = ABD + BC D + A B C
• using:
2-input
2 input LUTs
3-input LUTs
4-input LUTs
A A
B B
D D
B F B A
C C F B F
D D C
D
A A
B B
C C
Fast Carry Logic

 Each CLB contains separate
logic and routing for the fast
MSB
generation of sum & carry
signals
Carry Logic
Routing
– Increases efficiency and

performance of adders,
subtractors, accumulators,
comparators,
p , and counters
LSB
 Carry logic is independent of
normal logic and routing
resources
15
4/4/2011
Fast Carry Logic

• Simple, fast, and COUT
To S0 of the
COUT
To CIN of S2 of the next
complete next CLB
CLB
SLICE
arithmetic Logic S3
First Carry
– Dedicated XOR Chain
CIN
COUT
gate for single-
level sum SLICE
S2
completion
– Uses dedicated SLICE
routing
ti resources S1
CIN
COUT
Second
– All synthesis tools Carry
Chain
can infer carry SLICE
logic S0
CIN CIN CLB
Accessing Carry Logic

• All major synthesis tools can infer carry
logic for arithmetic functions
• Addition (SUM <= A + B)
• Subtraction (DIFF <= A - B)
• Comparators (if A < B then…)
• Counters (count <= count +1)
16
4/4/2011
Flexible Sequential Elements

• Either flip-flops or latches FDRSE_1
• Two in each slice;; eight

g in each CLB D S Q
CE
• Inputs come from LUTs or from an
independent CLB input R
• Separate set and reset controls FDCPE
– Can be synchronous or D PRE Q

asynchronous CE
• All controls are shared within a slice CLR
– Control signals can be inverted

locally within a slice LDCPE
D PRE Q
CE
G
CLR
Shift Register
LUT
• Each LUT can be IN

CE
D
CE
Q
configured as shift CLK

register
i t
D Q
– Serial in, serial out CE
• Dynamically addressable
delay up to 16 cycles
• For programmable LUT
= D
CE
Q OUT
pipeline
• Cascade for greater cycle
d l
delays
• Use CLB flip-flops to add D Q
depth CE
DEPTH[3:0]
17
4/4/2011
Shift Register
12 Cycles
Operation A Operation B
64
4 Cycles 8 Cycles
64
Operation C
3 Cycles
3 Cycles
9-Cycle imbalance
• Register
Register-rich
rich FPGA
– Allows for addition of pipeline stages to increase
throughput
• Data paths must be balanced to keep desired
functionality
Shift Register LUT Example
12 Cycles
Operation A Operation B
64
4 Cycles 8 Cycles
64
Operation C Operation D - NOP
3 Cycles 9 Cycles
Paths are Statically
Balanced
12 Cycles
18
4/4/2011
Distributed RAM RAM16X1S

D
WE
=
WCLK
LUT A0 O
• CLB LUT configurable as A1
A2
Distributed RAM A3
– An LUT equals 16x1 RAM RAM32X1S

D
WE
– Cascade LUTs to increase WCLK
A0 O
RAM size A1
A2
• Synchronous write A3
A4
• Asynchronous read
– Can create a synchronous
LUT
or RAM16X2S
D0
D1
WE
read by using extra flip-flops = WCLK

A0
A1
O0
O1
RAM16X1D
D
– Naturally,
Naturally distributed RAM A2
A3
WE
WCLK
read is asynchronous
or
A0 SPO
LUT
A1
• Two LUTs can make A2
A3
– 32 x 1 single-port RAM DPRA0 DPO
DPRA1
– 16 x 2 single-port RAM DPRA2
DPRA3
– 16 x 1 dual-port RAM
Xilinx Multipurpose LUT

19
4/4/2011
Simplified view of a Xilinx Logic Cell

16-bit SR
16x1 RAM
a 4-input
p
LUT
b
y
c
mux
d flip-flop
q
e
clock
clock enable
set/reset

RAM Blocks and Multipliers in Xilinx

FPGAs

20
4/4/2011
Embedded Ram Blocks

• A lot of applications require the use of memory, so FPGAs now
include relatively large chunks of embedded RAM called e-RAM or
Block RAM (BRAM).
( )
• Depending on the architecture of the component, these blocks might
be positioned around the periphery of the device or organized as
columns
• These blocks can be
used for a variety of
purposes, such as
implementing standard
single or dual port RAMs,
FIFO, e.t.c.
Block RAM
Port B
Port A
Spartan-3
Dual-Port
Block RAM
Block RAM
• Most efficient memory implementation

– Dedicated blocks of memory
• Ideal for most memory requirements
– 4 to 104 memory blocks
• 18 kbits = 18
18,432
432 bits per block (16 k without parity bits)
– Use multiple blocks for larger memories
• Builds both single and true dual-port RAMs
• Synchronous write and read (different from distributed
RAM)
21
4/4/2011
Spartan-3 Block RAM Amounts
Block RAM can have various

0
1 configurations (port aspect ratios)
2
4
0
0
8k x 2 4k x 4
4,095
16k x 1 8,191
8+1
0
2k x ((8+1))
2047
16+2
0
1023
1024 x (16+2)
16,383
22
4/4/2011
Block RAM Port Aspect Ratios
Single-Port Block RAM
23
4/4/2011
Dual-Port Block RAM
Dual-Port Bus Flexibility

RAMB4_S16_S8
WEA
ENA
Port A In RSTA DOA[17:0]
Port A Out
1K-Bit Depth CLKA 18-Bit Width
ADDRA[9:0]
DIA[17:0]
WEB
ENB
Port B In RSTB DOB[8:0]

Port B Out
2k-Bit Depth CLKB 9-Bit Width
ADDRB[10:0]
DIB[8:0]
• Each port can be configured with a different data bus

width
• Provides easy data width conversion without any
additional logic
24
4/4/2011
Two Independent Single-Port RAMs

RAMB4_S1_S1
Port A In WEA
8K-Bit Depth ENA Port A Out

RSTA DOA[0] 1-Bit Width
CLKA
0 ADDR[12
0, ADDR[12:0]
0]
ADDRA[12:0]
DIA[0]
Port B In WEB
Port B Out
8K-Bit Depth ENB
1-Bit Width
RSTB DOB[0]
CLKB
1, ADDR[12:0]
ADDRB[12:0]
DIB[0]
• Added advantage of True Dual

Dual- • To access the lower RAM
Port – Tie the MSB address bit to
– No wasted RAM Bits Logic Low
• Can split a Dual-Port 16K RAM • To access the upper RAM
into two Single-Port 8K RAM – Tie the MSB address bit to
– Simultaneous independent access Logic High
to each RAM
Embedded Multipliers
• Some functions, like multipliers are inherently slow if they are
implemented by connecting a large number of programmable
logic
g blocks together.
g
• Current FPGA incorporate special hard wired multiplier blocks
which are typically located in close proximity to the embedded
RAM blocks (Arithmetic Based Applications).
25
4/4/2011
18 x 18 Embedded Multiplier
• Fast arithmetic functions
– Optimized to implement
multiply / accumulate modules
18 x 18 signed multiplier
Fully combinational
O i
Optionall registers
i with
i h CE & RST ((pipeline)
i li )
Independent from adjacent block RAM
18 x 18 Multiplier
• Embedded 18-bit x 18-bit multiplier
– 2’s complement signed operation
M lti li
• Multipliers i d iin columns
are organized l
Data_A
(18 bits)
18 x 18 Output
Multiplier (36 bits)
Data_B
(18 bits)
26
4/4/2011
Positions of Multipliers
Asynchronous 18-bit Multiplier
27
4/4/2011
18-bit Multiplier with Register
A simple clock tree

Clock
Flip-flops
tree
Special clock
pin and pad
Clock signal from

outside world
28
4/4/2011
Digital Clock Manager (DCM)
Clock signal from

outside world Daughter clocks
Clock used to drive
internal clock trees
Manager
or output pins
etc.
Special clock
pin and pad

Digital Clock Managers (DCM)

• The clock pin is usually connected to special hard-wired function
called a clock-manager that generates “daughter clocks”.
• The daughter clocks may be used to drive internal clock trees or
external output pins that can be used to provide clocking
services to other devices on the host circuit board.
• There might be multiple clock managers supporting only a
subset of features (Jitter removal, Frequency Synthesis, …)
Clock signal from

outside world Daughter clocks
Clock used to drive
internal clock trees
Manager
or output pins
etc.
Special clock
pin and pad
29
4/4/2011
DCM: Jitter Removal

• In the real world clock edges may arrive a little early or a little late.
• A fuzzy clock would result (jitter) due to the delay encountered.
• The FPGA clock manager can be used to detect and correct for
this jitter and provide a “clean” daughter clock signal for use
inside the device.
DCM: Frequency Synthesis

• The frequency of the clock signal being presented to
the FPGA from the outside world might not be exactly
what the designer engineer wishes for.
• The clock manager can be used to generate daughter
clocks with frequencies that are derived by
multiplying or dividing the original signal.
30
4/4/2011
DCM: Phase Shifting

• Certain designs require the use of clocks that are
phase shifted (delayed) with respect to each other.
other
• Some clock managers allow you to select from fixed
phase shifts of common values such as 1200 and 2400
(for a three-phase clocking scheme)
Basic I/O Block Structure

Three-State D Q
FF Enable EC
Three-State
Clock SR Control
Set/Reset
Output D Q
FF Enable EC
Output Path
SR
Direct Input
FF Enable
Input Path
Registered Q D
Input EC
SR
31
4/4/2011
IOB Functionality
• IOB provides interface between the package
pins and CLBs
• Each IOB can work as uni- or bi-directional I/O
• Outputs can be forced into High Impedance
• Inputs and outputs can be registered
– advised for high-performance I/O
• Inputs can be delayed
Configurable I/O Impedances

• The signals used to connect devices on today’s circuit
board often have fast edge rates.
• In order to prevent signals reflecting back it is
necessary to apply appropriate terminating resistors
to the FPGA input and output pins.
• In the past, resistors were
applied as discrete
components (outside the
FPGA).
FPGA)
• Today's FPGAs allow the
use of internal
terminating resistors
whose value can be
configured by the user.
32
4/4/2011
Spartan 3 Family
Attributes
FPGA Nomenclature
33
4/4/2011
Spartan-3 FPGA Family

Members
2001 – Virtex-II FPGA Family

• Virtex-II FPGA introduced followed by Virtex-II Pro in 2003
– 444 18x18 Multipliers & 18kbit block RAMs introduced
– Gbit Serial I/O Communications & Power PC Processors Introduced
– Complex
C Floating Point Algorithm Implementation now possible
• Virtex-II / Pro
– 44,000 Logic Slices
– 444 18Kbits BRAMs
– 444 18x18 Multipliers
– 2 PowerPC
Processors
– 20 Gbit I/O
– 1164 Max User I/O
34
4/4/2011
Virtex II Pro Floorplan

Up to 16 serial transceivers
• 622 Mbps to 3.125 Gbps
• 1 tto 4 P
PowerPCs
PC
• 4 to 16 multi-gigabit
PowerPCs
transceivers
• 12 to 216 multipliers
• 3,000 to 50,000 logic
cells
• 200k to 4M bits RAM
Logic
cells
• 204 to 852 I/Os
Virtex-II Pro (Selection)
35
4/4/2011
Embedded Processor Cores

(Hard and Soft)
• The majority of designs make use of microprocessors.
• These appeared as discrete devices on the circuit board.
• Lately, high-end FPGAs have become available that
contain one or more embedded microprocessors
(referred to as microprocessor cores).
• There are two types of cores:
• A hard microprocessor core is implemented as a
dedicated predefined block (two approaches)
• A soft microprocessor core is implemented by
configuring a group of programmable logic blocks to
act as a microprocessor.
Embedded Core (Inside)

• Xilinx and Altera tend to embed one or more microprocessor
cores directly into the main FPGA fabric (PowerPC)
• In this case the design tools have to be able to take account of
the presence of these blocks in the fabric (any memory used by
the core is formed from the embedded RAM blocks).
 The main advantage of

this scheme is the
inherent speed
p
advantages to be
gained from having the
processor core in
intimate proximity to
FPGA fabric.
36
4/4/2011
Soft Core
• As opposed to embedding a microprocessor physically into the
fabric of the chip, it is possible to configure a group of
programmable
p g logic
g blocks to act as a microprocessor.
p
• Soft cores are simpler (more primitive) and slower than their
hard-core counterparts.
ADVANTAGE?
1. The main advantage of this

scheme is that the user need
only implement a core if
he/she needs it.
2. Also, the user can instantiate
as many cores as they
require until they run out of
resources!
Virtex Architectures
Built for high-performance applications
Other Families include

• Virtex-II Pro
• Virtex-4
• Virtex-5
Latest Family include
• Virtex-6
Basic Architecture 74
37
4/4/2011
Virtex-II Pro Architecture

Contains embedded Processors and Multi-Gigabit Transceivers
Advanced FPGA Logic –

99k logic cells
High performance True
Dual-port RAM - 8 Mb SelectIO™- Ultra
Technology - 1164 I/O
XtremeDSP Functionality -
Embedded multipliers
RocketIO™ and RocketIO X

High-speed Serial
Transceivers 622 Mbps to
3.125 Gbps
PowerPC™ Processors
400+ MHz Clock Rate - 2
XCITE Digitally
Controlled Impedance -
Any I/O
DCM™ Digital Clock
Management - 12
130 nm, 9 layer copper in 300 mm wafer technology

Virtex-4 Family
Advanced Silicon Modular BLock (ASMBL) Architecture
Optimized for logic, Embedded, and Signal Processing
LX FX SX
Resource
Logic 14K–
14K –200K LCs 12K–
12K –140K LCs 23K–
23K –55K LCs
Memory 0.9–
0.9–6 Mb 0.6–
0.6–10 Mb 2.3–
2.3–5.7 Mb
DCMs 4–12 4–20 4–8
DSP Slices 32–

32–96 32–
32–192 128–
128–512
SelectIO 240–
240–960 240–
240–896 320–
320–640
N/A
RocketIO N/A 0–24 Channels
N/A N/A
PowerPC 1 or 2 Cores
N/A N/A
Ethernet MAC 2 or 4 Cores
38
4/4/2011
Virtex-4 Architecture
RocketIO™
Multi-Gigabit Smart RAM
New block RAM/FIFO
Transceivers
622 Mbps–10.3 Gbps
Xesium Clocking
Advanced CLBs Technology
200K Logic Cells 500 MHz
Tri-Mode
Ethernet MAC
XtremeDSP™ 10/100/1000 Mbps
Technology Slices
256 18x18 GMACs
1 Gbps SelectIO™
PowerPC™ 405 ChipSync™ Source synch,
with APU Interface XCITE Active Termination
450 MHz, 680 DMIPS
Virtex-5 Family
Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity
Virtex™-5 Platforms LX LXT SXT FXT

The image cannot be display ed. Your computer may not hav e enough memory to open the image, or the image may hav e been corrupted. Restart y our computer, and then The image cannot be display ed. Your computer may not hav e enough memory to open the image, or the image may hav e been corrupted. Restart y our computer, and then The image cannot be display ed. Your computer may not hav e enough memory to open the image, or the image may hav e been corrupted. Restart y our computer, and then The image cannot be display ed. Your computer may not hav e enough memory to open the image, or the image may hav e been corrupted. Restart y our computer, and then
open the file again. If the red x still appears, y ou may hav e to delete the image and then insert it again. open the file again. If the red x still appears, y ou may hav e to delete the image and then insert it again. open the file again. If the red x still appears, y ou may hav e to delete the image and then insert it again. open the file again. If the red x still appears, y ou may hav e to delete the image and then insert it again.
Logic Logic/Serial DSP/Serial Emb./Serial

Logic
On-chip RAM
DSP Capabilities
Parallel I/Os
Serial I/Os
PowerPC® Processors
39
4/4/2011
Virtex-5 Architecture
Enhanced New
36Kbit Dual
Dual--Port Block RAM / Most Advanced High-
High-
g
FIFO with Integrated ECC Performance Real 6LUT Logic
Fabric
550 MHz Clock Management Tile PCI Express® Endpoint Block
with DCM and PLL
SelectIO with ChipSync System Monitor Function with

Technology and XCITE DCI Built--in ADC
Built
Advanced Configuration Options Next Generation PowerPC®

Embedded Processor
25x18 DSP Slice with Integrated
ALU
RocketIO™ Transceiver Options
Low-Power GTP: Up to 3.75 Gbps
Low-
Tri-Mode 10/100/1000 Mbps
Tri- High--Performance GTX: Up to 6.5
High
Ethernet MACs Gbps
TheBuiltSpartan-3 Family
for high volume, low-cost applications
18x18 bit Embedded

Pipelined Multipliers
for efficient DSP Configurable 18K Block
RAMs + Distributed RAM
Spartan-3 Bank
Bank 3
Bank 1
Bank
2
4 I/O Banks,
Support for
Up to eight on-chip all I/O Standards
Digital Clock Managers including
to support multiple PCI, DDR333,
system clocks RSDS, mini-LVDS
40
4/4/2011
Spartan-3 Family
Based upon Virtex-II Architecture – Optimized for Lower Cost
• Smaller process = lower core voltage

– .09 micron versus .15 micron
– Vccint = 1.2V versus 1.5V
• Logic resources
– Only one-half of the slices support RAM or SRL16s (SLICEM)
– Fewer block RAMs and multiplier blocks
• Clock Resources
– Fewer global clock multiplexers and DCM blocks
• I/O Resources
– Fewer pins per package
– No internal 3-state buffers
– Support for different standards
• New standards: 1.2V LVCMOS, 1.8V HSTL, and SSTL
• Default is LVCMOS, versus LVTTL
SLICEM and SLICEL

• Each Spartan™-3 CLB
Left-Hand SLICEM Right-Hand SLICEL
contains four slices COUT COUT
– Similar to the Virtex™-II

• Slices are grouped in Slice X1Y1
pairs
– Left-hand SLICEM Slice X1Y0
SHIFTIN
(Memory) Switch
Matrix
• LUTs can be
g
configured y
as memory Slice X0Y1
or SRL16
– Right-hand SLICEL Slice X0Y0
Fast Connects
(Logic)
• LUT can be used as CIN
SHIFTOUT CIN
logic only
41
4/4/2011
Multiple Domain-optimized Platforms
Spartan-3E Features
• More gates per I/O than • 16 BUFGMUXes on left
Spartan-3
Spartan 3 and right sides
• Removed some I/O – Drive half the chip only
standards – In addition to eight global
clocks
– Higher-drive LVCMOS
– GTL, GTLP • Pipelined multipliers
– SSTL2_II • Additional configuration
– HSTL II 18 HSTL_I,
HSTL_II_18, HSTL I modes
HSTL_III – SPI, BPI
– LVDS_EXT, ULVDS – Multi-Boot mode
• DDR Cascade
– Internal data is presented
on a single clock edge
42
4/4/2011
Spartan-3A DSP Features

• Increased amount of block memory (BRAM)
– 1512K of S3A1800 vs 648 K of S3E1600
• More XtremeDSP DSP48A slices
– Replaces Embedded multiplier of Spartan-3E
• 3400A – 126 DSP48As
• 1800A – 84 DSP48As
Spartan-3A DSP
Tuning DSP Performance
XtremeDSP DSP48A Slice

• Integrated
Xt
XtremeDSP
DSP Slice
Sli
– Application optimized
capacity
– Integrated pre-adder
optimized for filters
– 250 MHz operation,
standard speed grade
– Compatible with Virtex-
DSP
• Increased memory capacity and performance
– Also important for embedded processing, complex
IP, etc Basic Architecture 86
43
4/4/2011
Function
DSP48 Comparison
DSP48 DSP48E DSP48A Benefit
Multiplier 18 x 18 25 x 18 18 x 18 Reduces FPGA resource needs for DSP algorithms.
Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better
performance. Important in FIR filter construction.
Cascade Inputs One Two One E bl ffastt data

Enables d t path
th chaining
h i i off DSP48 bl
blocks
k ffor llarger filt
filters.
Cascade Output Yes Yes Yes Enables fast data path chaining of DSP48 blocks for larger filters.
Dedicated C No Yes Yes The C input supports many 3-input mathematical functions, such as 3-
input input addition and 2-input multiplication with a single addition and the
very valuable rounding of multiplication away from zero.
Adder 3 input 48 3 input 48 2 input 48 Supports simple add and accumulate functions.
bit bit bit
Dynamic Yes Yes Yes One DSP48 can provide more than one function.. Multiply, Multiply-add,
Opmodes multiply-accumulate etc.
ALU Logic No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU
Functions function on a clock cycle basis Enables multiple functions to be selected.
(Add, Subtract, or Compare)
Pattern Detect No Yes No This feature supports convergent rounding, underflow/overflow detection
for saturation arithmetic, and auto-resetting counters/accumulators.
SIMD ALU No Yes No

Support Enables parallel ALU operations on multiple data sets.
Carry Signals Carry In Carry In & Carry In & Supports fast carry functions between DSP blocks. Often a speed
Out Out limiting path.
Spartan-3A Device Table

Spartan-3 Spartan-DSP
Spartan-3A Spartan-3A DSP
XC3S1400A XC3SD1800A XC3SD3400A
XtremeDSP DSP48A Slices - 84 126
Dedicated Multipliers 32 DSP48As DSP48As
Block Ram Blocks 32 84 126
Block RAM (Kb) 576 1,512 2,268

Distributed RAM (Kb) 176 260 373
FFs/LUTs 22,528 33,280 47,744
L i Cells
Logic C ll 25 344
25,344 37 440
37,440 53 712
53,
DCMs 8 8 8
Max Diff I/O Pairs 227 227 213
CS484 19x19mm (0.8mm pitch) - 309 309
*FG676 27x27mm (1.0mm pitch) 502 519 469
44
4/4/2011
Latest Families
Architecture Alignment
Virtex-6 FPGAs Spartan-6 FPGAs
760K Common Resources 150K

Logic Cell Logic Cell
Device LUT-6 CLB Device
BlockRAM
DSP Slices
High-performance Clocking
FIFO Logic Parallel I/O Hardened Memory Controllers
Tri-mode EMAC HSS Transceivers* 3.3 Volt compatible I/O
System Monitor PCIe® Interface
*Optimized for target application in each family
Enables IP Portability, Protects Design Investments

45
4/4/2011
Addressing the Broad Range of Technical

Requirements
Spartan-6 LX
Spartan-6 LXT
Virtex-6 LXT
Lowest cost Virtex-6 HXT

logic + DSP
Market Size
Lowest logic +
high-speed serial Virtex-6 SXT
High logic density +

serial connectivity
Ultra high-speed serial

connectivity + logic
DSP + logic +
serial connectivity
Application Market Segments + 100s More

Designers Eccentrics
• Higher System Performance
– More design margin to simplify designs
– Higher integrated functionality
• Lower System Cost
– Reduce BOM
– Implement design in a smaller device & lower speed-
grade
• Lower Power
– Help meet power budgets
– Eliminate heat sinks & fans
– Prevent
Basic Architecture 92 thermal runaway
46
4/4/2011
Virtex-6 Family
Virtex® Product & Process Evolution

Virtex-6
40-nm
Virtex-5
6
65-nm
Virtex-4
90-nm
Virtex-II Pro
130-nm
Virtex-II
150-nm
Virtex-E
180 nm
180-nm
Virtex
220-nm
1st Generation 2nd Generation 3rd Generation 4th Generation 5th Generation 6th Generation
Delivering Balanced Performance, Power, and Cost

Virtex-6 Base Platform 94
47
4/4/2011
Strong Focus on Power Reduction
• Static Power Reduction

– Higher distribution of low leakage transistors
• Dynamic
D i PPower R
Reduction
d ti
– Reduced capacitance through device shrink
• Reduced Core Voltage Devices Lower Overall Power
– VCCINT = 0.9V option allows power / performance tradeoff
• I/O Power Improvements
– Dynamic termination
• System Monitor
– Allows sophisticated monitoring of temperature and voltage
Up to 50% Power Reduction vs. Previous Generation

Virtex-6 Logic Fabric

• Virtex-6 Configurable Logic Block (CLB) Slice
– Each CLB contains two slices LUT
– Each slice contains four 6-input

6 input Lookup Tables LUT
Slice
(6LUT) LUT
LUT
• Slices implement logic functions (slice_l) LUT
• Slices for memories and shift registers LUT
(slice_m) LUT
• LUT6 implements LUT

CLB
– All functions of up to 6 variables
– Two functions of up to 5 or less variables each
– Shift registers up to 32 stages long
– Memories
Power Consumption of 64 bits
Benefits Performance Benefits Cost Benefits
• Shift register •mode
Multiple configurations
greatly reduces power within
• Increased a slice
ratio of slice_m – memories • Can pack logic and memory functions more
consumption over FF implementation available closer to the source or target logic efficiently
48
4/4/2011
Higher DSP Performance

• Most advanced DSP architecture
– New optional pre-adder for symmetric filters
– 25x18 multiplier
• High resolution filters
• Efficient floating point support
– ALU-like second stage enables mapping of
advanced operations
• Programmable op-code
• SIMD support
• Addition / Subtraction / Logic functions
– Pattern detector
• Lowest power consumption
• Highest DSP slice capacity
– Up to 2K DSP Slices
Virtex®-6 LXT / SXT FPGAs
49
4/4/2011
Spartan-6 Family
Spartan-6
• Next Generation 45nm Spartan Family
– Increased performance & density
– Evolutionary feature enhancements
– Dramatic cost & power reductions
• Two Silicon Platforms

– LX: Cost optimized Logic, Memory
– LXT: LX features plus High-Speed Serial
Connectivity
– More unified & integrated with Virtex
Delivering the Optimal Balanced of Cost, Power & Performance

50
4/4/2011
Spartan-6 Logic Evolution

Higher Performance, Increased Utilization
• Modified Virtex 6-input LUT

– 4 additional flip-flops per NEW Efficient Design
slice
– Higher utilization for register Spartan-
Spartan-3A Series & Spartan--6
Spartan
intensive designs Earlier LUT / FF Pair LUT / Dual FF
• Efficient & Capable Pair
– Logic 6LUT
– Arithmetic functions
4LUT
– Distributed RAM & shift
registers
– Interconnect
• Up to 25% Higher
Performance Great
6-input LUT & 2nd Flip-
Flip-
General-Purpose
General-
Logic flop for Higher
Utilization
Spartan-6 CLB Logic Slices

SliceM (25%) SliceL (25%) SliceX (50%)
 LUT6  LUT6  LUT6

 8 Registers  8 Registers  Optimized
p for Logic
g
 Carry Logic  Carry Logic  8 Registers
 Wide Function Muxes  Wide Function Muxes
 Distributed RAM / SRL logic
Slice mix chosen for the optimal balance of Cost, Power & Performance
51
4/4/2011
Spartan-6 Lowest Total Power
• Static power reductions

– Process & architectural innovations
• Dynamic power reduction
– Lower node capacitance & architectural innovations
• More hard IP functionality
– Integrated transceivers & other logic reduces power
– Hard IP uses less current & power than soft IP
• Lower IO power
• Low power option -1L reduces power even
further
• Fewer supply rails reduces power
Spartan-6 Hard Memory Controller
• New Hard Block Memory Controller

– Up to 4 controllers per device
• Why a Hard Memory Block?
– Very common design component
– Multiple customer benefits
Spartan-6 Hard Block Memory Controller

Customer Requests
Benefits
Higher performance • Up to 800 Mbps
Lower cost • Saves soft logic, smaller die
Lower power • Dedicated logic
• Timing closure no longer an issue
Easier designs • Configurable MultiPort user interface
• CoreGen/MIG wizard & EDK support
52
4/4/2011
Memory Controller
• Only low cost FPGA with a “hard” memory controller
• Guaranteed
G t d memory interface
i t f performance
f providing
idi
– Reduced engineering & board design time
– DDR, DDR2, DDR3 & LP DDR support
– Up to 12.8Mbps bandwidth for each memory controller
• Automatic calibration features DRAM
• Multiport
M lti t structure
t t ffor user iinterface
t f SRAM
DRAM
DDR
– Six 32-bit programmable ports from fabric Spartan-6 DDR2
FLASH DDR3
– Controller interface to 4, 8 or 16 bit memories devices LP DDR
EEPROM
Integrated DSP Slice

XtremeDSP DSP48A1 Slice
• 250 MHz implementation
– Fast multiplier & 48 bit
adder
– ASIC-like performance
• Input and output registers
for higher speed
Optimizes FIR filter applications
Super Regional Training 106
53
4/4/2011
Better, More BRAM
• More Block RAMs

– 2x higher BRAM to Logic Cell ratio than Spartan-3A 9K BRAM
platform
18K BRAM
• More port flexibility
– 18K can be split into two 9K BRAM blocks and can OR 9K BRAM
be independently addressed
• Improves buffering, caching & data

storage
– Excellent for embedded processing, communication
protocols
– Enables DSP blocks to provide more efficient video
and surveillance algorithms
• Lower Static Power

Compare to Spartan-3A
Twice the Capabilities, Half the Power, Hard Blocks!
Feature Extended Spartan-3A (90nm) Spartan-6 (45nm)
Logic Cells (Kbit) Up to 55K Up to 150K

LUT Design 44-input
input LUT + FF 66-input
input LUT + 2FF
Block RAM (Mbit) Up to 2 Mbit Up to 5 Mbit
Transceiver Count / Speed no Up to 8 / Up to 3.125 Gbps
Voltage Scaling No (1.2V only) Yes (1.2V, 1.0V)
Static Power (typ mW) 11 mW (smallest density) Up to 60% less!
Memory Interface 400 Mbps DDR3 800 Mbps
Max Differential IO 640 Mbps 1050 Mbps
Multipliers/DSP Up to 126 Multipliers / DSP Up to 184 DSP48 Blocks
Memory Controllers no Up to 4 Hard Blocks
Clock Management DCM Only DCM & PLL
PCI Express Endpoint no Yes, Gen 1
Security Device DNA Only Device DNA & AES
54
4/4/2011
Spartan-6 LX / LXT FPGAs
** All memory controller support x16 interface, except in CS225 package where x8 only is supported
FPGA Design Flow
55
4/4/2011
Design process (1)

Specification
Design and implement a simple unit permitting to
speed up encryption with RC5-similar cipher with
fixed key set on 8031 microcontroller. Unlike in
the experiment 5, this time your unit has to be able
to perform an encryption algorithm by itself,
executing 32 rounds…..
Verilog description (Your Verilog Source Files)

Library IEEE;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity RC5_core is
Functional simulation
port(
clock, reset, encr_decr: in std_logic;
data_input: in std_logic_vector(31 downto 0);
data_output: out std_logic_vector(31 downto 0);
out_full: in std_logic;
key_input: in std_logic_vector(31 downto 0);
key_read: out std_logic;
);
end AES_core;
Synthesis
Post-synthesis
y simulation
Design process (2)

Implementation
(Mapping, Placing & Routing)
Timing
g simulation
Configuration
On chip testing
56
4/4/2011
Design Process control from Active-HDL
Logic Synthesis
VHDL description Circuit netlist

architecture MLU_DATAFLOW of MLU is
signal A1:STD_LOGIC;
signal B1:STD_LOGIC;
signal Y1:STD_LOGIC;
signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
begin
A1<=A when (NEG_A='0') else
not A;
B1<=B when (NEG_B='0') else
not B;
Y<=Y1 when (NEG_Y='0') else
not Y1;
MUX_0<=A1 and B1;

MUX 1<=A1
MUX_1< A1 or B1;
MUX_2<=A1 xor B1;
MUX_3<=A1 xnor B1;
with (L1 & L0) select

Y1<=MUX_0 when "00",
MUX_1 when "01",
MUX_2 when "10",
MUX_3 when others;
end MLU_DATAFLOW;
57
4/4/2011
Synthesis Tools
XST
… and others
Features of synthesis tools
• Interpret
p RTL code
• Synplify Pro: Produces synthesized circuit netlist in a standard
EDIF (.edf) format
– Can optionally produce .VHM (VHDL code merged into one) file
for post-synthesis simulation
• XST: Produces synthesized circuit netlist in NGC format
• Netlist is composed of gates in the particular Xilinx
implementation library
– http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has
information on libraries
• Give preliminary performance estimates
• Some can display circuit schematics corresponding to EDIF
netlist
58
4/4/2011
Timing report after synthesis

Performance Summary
*******************
Worst slack in design: -0.924
Requested Estimated Requested Estimated

Clock Clock
Starting Clock Frequency Frequency Period Period Slack
Type Group
-------------------------------------------------------------------------------------------------------
exam1|clk 85.0 MHz 78.8 MHz 11.765 12.688 -0.924
0.924
inferred Inferred_clkgroup_0
System 85.0 MHz 86.4 MHz 11.765 11.572 0.193
system default_clkgroup
===========================================================
Implementation
• After synthesis the entire implementation

process is performed by FPGA vendor
tools
59
4/4/2011
Mapping
LUT0
LUT4
LUT1
FF1
LUT5
LUT2
FF2
LUT3
60
4/4/2011
Placing FPGA
CLB SLICES
Routing FPGA
Programmable Connections
61
4/4/2011
Map report header

Release 7.1.03i Map H.41
Xilinx Mapping Report File for Design 'exam1'
Design Information
------------------
Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k
4
-cm area -c 100 -tx off exam1.ngd exam1.pcf
Target Device : xc2s200
Target Package : fg256
Target Speed : -6
Mapper Version : spartan2 -- $Revision: 1.26.6.4 $
Mapped Date : Wed Nov 02 11:15:15 2005
Map report
Design Summary
--------------
Number of errors: 0
Number of warnings: 0
Logic Utilization:
Number of Slice Flip Flops: 144 out of 4,704 3%
Number of 4 input LUTs: 173 out of 4,704 3%
Logic Distribution:
Number of occupied Slices: 145 out of 2,352 6%
Number of Slices containing only related logic: 145 out of 145 100%
Number of Slices containing
g unrelated logic:
g 0 out of 145 0%
*See NOTES below for an explanation of the effects of unrelated logic
Total Number 4 input LUTs: 210 out of 4,704 4%
Number used as logic: 173
Number used as a route-thru: 5
Number used as 16x1 RAMs: 32
Number of bonded IOBs: 74 out of 176 42%
Number of GCLKs: 1 out of 4 25%
Number of GCLKIOBs: 1 out of 4 25
62
4/4/2011
Place & route report

Timing Score: 0
Asterisk (*) preceding a constraint indicates it was not met

met.
This may be due to a setup or hold violation.
--------------------------------------------------------------------------------
Constraint | Requested | Actual | Logic
| | | Levels
--------------------------------------------------------------------------------
TS_clk = PERIOD TIMEGRP "clk" 11.765 ns | 11.765ns | 11.622ns | 13
HIGH 50% | | |
--------------------------------------------------------------------------------
OFFSET = OUT 11.765 ns AFTER COMP "clk" | 11.765ns | 11.491ns | 1
--------------------------------------------------------------------------------
OFFSET = IN 11.765 ns BEFORE COMP "clk" | 11.765ns | 11.442ns | 2
--------------------------------------------------------------------------------
Post layout timing report

Timing summary:
---------------
Timing errors: 0 Score: 0
Constraints cover 42912 paths, 0 nets, and 1038 connections
Design statistics:
Minimum period: 11.622ns (Maximum frequency:
86.044MHz)
Minimum input required time before clock: 11.442ns
Minimum output required time after clock: 11.491ns
63
4/4/2011
Post-place-and-route simulation
• After place-and-route performed, can do
post-place-and-route
t l d t simulation
i l ti
– Now have real timing information!
– Also can do static timing analysis: shows the
worst case critical path in circuit
Configuration
• Once a design is implemented, you must create
a file that the FPGA can understand
– This file is called a bit stream: a BIT file (.bit
extension)
• The BIT file can be downloaded directly to the

FPGA or can be converted into a PROM file
FPGA,
which stores the programming information
64
4/4/2011
Configuration of SRAM based FPGAs

System Gates vs. Real Gates

• One common metric used to measure the size of a device in the ASIC
world is that of equivalent gates (e-
(e-gate)
• Convention used:
• A 2-input NAND function to represent one equivalent gate.
• An equivalent gate consists of an arbitrary number of transistors.
• Different vendors provide different functions in their cell libraries,
where each implementation of each function requires a different
number of transistors (difficult to compare capacity/complexity)
• Solution: Assign each function an equivalent gate value and sum all
th
these values.
l
• How can we establish a basis for comparison between FPGAs and

ASICs?
• Can an ASIC of 500,000 equivalent gates that needs to be migrated
into an FPGA fit into a particular FPGA?
65
4/4/2011
FPGAs: System Gates

• System Gates:
Gates A 4-input LUT can be used to represent
anywhere
h between
b t one andd more than
th twenty
t t 2-input
2i t primitive
i iti
logic gates.
• Rule of thumb?
• Divide the system gates value by three, so a three million FPGA
system gates would equate to one million ASIC equivalent gates!!
• However, to make comparisons between two different
implementations on an FPGA (i.e. Floating point adder vs. Fixed
point adder) designers should use the resources available in an
FPGA:
• Number of 4-input LUTs used
• Number of embedded multipliers
• Number of embedded RAM blocks
State-of-the-Art FPGAs
• 65-90 nm process on 300 mm wafers
• Lower cost per function (LUT + register)
• Smaller and faster transistors: Higher speed
• System speed up to 500 MHz
• Mainly through smart interconnects, clock management,
dedicated circuits, flexible I/O.
• Integrated transceivers running at 10 Gigabits/sec
• More Logic and Better Features:
• >100,000 LUTs & flip-flops
• >200 embedded RAMs, and same number 18 x 18 multipliers
• 1156 pins
i (balls)
(b ll ) with
ith >800 GP I/O
• 50 I/O standards, incl. LVDS with internal termination
• 16 low-skew global clock lines
• Multiple clock management circuits
• On-chip microprocessor(s) and multi-Gbps transceivers
66
4/4/2011
Latest Devices: Capacity & Features

Xilinx Virtex-5 Altera Stratix-II
• 65nm process • 90nm process
• Up to 960 I/Os
/O • Up to 1170 I/Os
• >200000 logic cells • 179000 logic elements
• Up to 552 18kb block RAMs • 9.6Mb embedded RAM
(~10Mb RAM)
• 96 DSP blocks: 380 18x18
• 450 DSP slices (18x18 multipliers
multiplier-accumulator)
• 20 digital clock managers
• 12 PLLs
(DCM)
• 24 high-speed serial
transceivers (622Mb/s to • Serial I/O up to 1Gb/s
11.1Gb/s)
• Up to four PowerPC 405 • No hard processor cores
cores
FPGAs Becoming More Attractive
21 X Bigger
C a p a c ity
S peed
P ric e
5.5 X Faster
50 X Less Expensive
1/9 1 1/92 1/93 1 /94 1/9 5 1/96 1/97 1/98 1 /99

Source: Xilinx
Y ear
67
4/4/2011
FPGA Shortcomings
• Circuit Delay
• Delay increases due to programmable switches in the
FPGA routing architecture
• Area
• Configuration cells and programmable resources
incur substantial area penalty
• Power
• Typically not suited for low power applications
Performance Cost Time to market
ASIC ASIC ASIC

Need to improve
FPGA FPGA FPGA
Conclusion
• FPGAs are the main enabler of Reconfigurable Computing

Systems
• FPGAs fill the gap between Instruction Set Processors
(GPs) and ASICS.
– Advantages: Flexible, programmable,
– Disadvantages: Power dissipation, performance w.r.t. ASIC
• Applicability of FPGAs relies on CAD tools provided by
diff
different
t vendors
d such
h as Xili
Xilinx and
d Alt
Altera
• RCS can be realized with several technologies:
– FPGAs: Fine/Medium Grain
– Coarse Grain Reconfigurable Architectures: CGRAs
68

04 FPGA Updated

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 FPGA Updated

Uploaded by

Copyright:

Available Formats

4/4/2011

Dr. Arshad Aziz

Basic FPGA Architecture

The Design Warrior’s Guide to FPGAs

Major FPGA vendors

Instant-on No Yes Yes

Size of Large Medium-small

Rad Hard No Yes Not really

The Programmable Marketplace

PLD Segment FPGA Sub-Segment

Altera Xilinx Altera All Others

Source: Company reports

• Main headquarters in San Jose, CA

 Seiko Epson (Japan)

Xilinx FPGA Families

General structure of an FPGA

The Design Warrior’s Guide to FPGAs

Generic FPGA architecture:

CLB CLB L i cell

Logic cell Logic cell

Logic cell Logic cell

The Design Warrior’s Guide to FPGAs

Xilinx Point of Reference

CLB Structure of Spartan 3

four slices Slice S3

– Local routing provides

Simplified view of a Xilinx Logic Cell

The Design Warrior’s Guide to FPGAs

Simplified Slice Structure

by all 16 CLB outputs CLR

• Carry logic runs vertically,

carry chains per CLB

Detailed Slice Structure

SRAM Cell (Pass Transistor)

Look Up Table (LUT)

LUT (Look-Up Table)

5-Input Functions implemented

5-Input Functions implemented using two

Dedicated Expansion Multiplexers

• MUXF5 combines 2 LUTs to create CLB

• Any 5-input function (LUT5) Slice

• Or selected functions up to 9 inputs LUT MUXF6

• Or 4x1 multiplexer LUT

• Any 6-input function (LUT6) LUT

• Or selected functions up to 19 inputs

• Dedicated muxes are faster and

Connecting Look-Up Tables

MUXF7 outputs (from the CLB

MUXF7 combines the two

MUXF6 combines slices S0 and S1

MUXF5 combines LUTs in each slice

Programmable Logic Block

3-, 4-, 5-, or 6-input LUTs?

FPGA Function generators

Fast Carry Logic

– Increases efficiency and

Fast Carry Logic

Accessing Carry Logic

Flexible Sequential Elements

• Two in each slice;; eight

• Separate set and reset controls FDCPE

– Can be synchronous or D PRE Q

• All controls are shared within a slice CLR