Professional Documents
Culture Documents
EE 811
Advanced Digital System Design
Technology Timeline
1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Transistors
ICs (General)
SRAMs & DRAMs
Microprocessors
SPLDs
CPLDs
ASICs
FPGAs
1
4/4/2011
Antifuse
A tif andd fl
flash-based
hb d FPGA
FPGAs
Actel Corp. – www.actel.com
QuickLogic Corp. – www.quicklogic.com
E2PROM /
Feature SRAM Antifuse
FLASH
One or more One or more
Technology node State-of-the-art
generations behind generations behind
Yes Yes (in-system
Reprogrammable No
(in system) or offline)
Reprogramming
3x slower
speed (inc. Fast ----
than SRAM
erasing)
Volatile (must
No
be programmed Yes No
(but can be if required)
on power-up)
Requires external
Yes No No
configuration file
Good for Yes Yes
No
prototyping (very good) (reasonable)
Acceptable
IP Security (especially when using Very Good Very Good
bitstream encryption)
2
4/4/2011
L tti
Lattice QuickLogic:
Q i kL i 2% Xilinx
Actel
Other: 2%
5% 7%
58%
33% 51%
31% 11%
FPGA Families
Low-cost High-performance
– Spartan 3 Virtex 4 LX / SX / FX
– Spartan 3E Virtex 5 LX
– Spartan 3L
Xilinx
Cyclone II Stratix II
Stratix II GX
Altera
3
4/4/2011
Xilinx
• Primary products: FPGAs and the associated CAD
software
Programmable
Logic Devices ISE Alliance and Foundation
Series Design Software
• TSMC (Taiwan)
Source: [Xilinx Inc.]
Xilinx
• Primary products: FPGAs and the associated CAD software
Programmable
Logic Devices ISE Alliance and Foundation
Series Design Software
• Main headquarters in San Jose, CA
• Fabless* Semiconductor and Software Company
• UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in
1996}
• Seiko Epson (Japan)
• TSMC (Taiwan)
Source: [Xilinx Inc.]
4
4/4/2011
5
4/4/2011
Xilinx FPGA
Configurable
Logic
Blocks
Block RAMs
Block RAMs
I/O
Blocks
Block
RAMs
Wire segments
Switch Block
Routing Channels
I/O pad
6
4/4/2011
Xilinx CLB
Configurable logic block (CLB)
Slice Slice
Slice Slice
CLB CLB Logic cell Logic cell
7
4/4/2011
to general routing
CIN CIN
resources
a 4-input
p
LUT
b
y
c
mux
d flip-flop
q
e
clock
clock enable
set/reset
8
4/4/2011
9
4/4/2011
Look-Up Tables
• Combinatorial logic is stored in Look-Up A B C D Z
Tables (LUTs) 0 0 0 0 0
– Also called Function Generators (FGs)
0 0 0 1 0
– Capacity is limited by the number of inputs, not
0 0 1 0 0
by the complexity
0 0 1 1 1
• Delay through the LUT is constant
0 1 0 0 1
Combinatorial Logic 0 1 0 1 1
A . . .
B 1 1 0 0 0
Z
C 1 1 0 1 0
D
1 1 1 0 0
1 1 1 1 1
10
4/4/2011
11
4/4/2011
BX nBX
BX
1
0
12
4/4/2011
Slice S3
MUXF6 combines slices S2
F6
and S3
F5
Slice S2
MUXF6 outputs
Slice S1
F5
F
Slice S0
F5
13
4/4/2011
14
4/4/2011
A A
B B
D D
B F B A
C C F B F
D D C
D
A A
B B
C C
15
4/4/2011
SLICE
arithmetic Logic S3
First Carry
– Dedicated XOR Chain
CIN
COUT
gate for single-
level sum SLICE
S2
completion
– Uses dedicated SLICE
routing
ti resources S1
CIN
COUT
Second
– All synthesis tools Carry
Chain
can infer carry SLICE
logic S0
CIN CIN CLB
16
4/4/2011
D PRE Q
CE
G
CLR
Shift Register
LUT
• Dynamically addressable
delay up to 16 cycles
• For programmable LUT
= D
CE
Q OUT
pipeline
• Cascade for greater cycle
d l
delays
• Use CLB flip-flops to add D Q
depth CE
DEPTH[3:0]
17
4/4/2011
Shift Register
12 Cycles
Operation A Operation B
64
4 Cycles 8 Cycles
64
Operation C
3 Cycles
3 Cycles
9-Cycle imbalance
• Register
Register-rich
rich FPGA
– Allows for addition of pipeline stages to increase
throughput
• Data paths must be balanced to keep desired
functionality
12 Cycles
Operation A Operation B
64
4 Cycles 8 Cycles
64
Operation C Operation D - NOP
3 Cycles 9 Cycles
Paths are Statically
Balanced
12 Cycles
18
4/4/2011
=
WCLK
LUT A0 O
• CLB LUT configurable as A1
A2
Distributed RAM A3
• Synchronous write A3
A4
• Asynchronous read
– Can create a synchronous
LUT
or RAM16X2S
D0
D1
WE
– Naturally,
Naturally distributed RAM A2
A3
WE
WCLK
read is asynchronous
or
A0 SPO
LUT
A1
• Two LUTs can make A2
A3
– 32 x 1 single-port RAM DPRA0 DPO
DPRA1
– 16 x 2 single-port RAM DPRA2
DPRA3
– 16 x 1 dual-port RAM
19
4/4/2011
a 4-input
p
LUT
b
y
c
mux
d flip-flop
q
e
clock
clock enable
set/reset
20
4/4/2011
Block RAM
Port B
Port A
Spartan-3
Dual-Port
Block RAM
Block RAM
21
4/4/2011
8k x 2 4k x 4
4,095
16k x 1 8,191
8+1
0
2k x ((8+1))
2047
16+2
0
1023
1024 x (16+2)
16,383
22
4/4/2011
23
4/4/2011
WEB
ENB
24
4/4/2011
Port B In WEB
Port B Out
8K-Bit Depth ENB
1-Bit Width
RSTB DOB[0]
CLKB
1, ADDR[12:0]
ADDRB[12:0]
DIB[0]
Embedded Multipliers
• Some functions, like multipliers are inherently slow if they are
implemented by connecting a large number of programmable
logic
g blocks together.
g
• Current FPGA incorporate special hard wired multiplier blocks
which are typically located in close proximity to the embedded
RAM blocks (Arithmetic Based Applications).
25
4/4/2011
18 x 18 Embedded Multiplier
• Fast arithmetic functions
– Optimized to implement
multiply / accumulate modules
18 x 18 signed multiplier
Fully combinational
O i
Optionall registers
i with
i h CE & RST ((pipeline)
i li )
Independent from adjacent block RAM
18 x 18 Multiplier
• Embedded 18-bit x 18-bit multiplier
– 2’s complement signed operation
M lti li
• Multipliers i d iin columns
are organized l
Data_A
(18 bits)
18 x 18 Output
Multiplier (36 bits)
Data_B
(18 bits)
26
4/4/2011
Positions of Multipliers
27
4/4/2011
Special clock
pin and pad
28
4/4/2011
Special clock
pin and pad
Special clock
pin and pad
29
4/4/2011
30
4/4/2011
Output D Q
FF Enable EC
Output Path
SR
Direct Input
FF Enable
Input Path
Registered Q D
Input EC
SR
31
4/4/2011
IOB Functionality
• IOB provides interface between the package
pins and CLBs
• Each IOB can work as uni- or bi-directional I/O
• Outputs can be forced into High Impedance
• Inputs and outputs can be registered
– advised for high-performance I/O
• Inputs can be delayed
32
4/4/2011
Spartan 3 Family
Attributes
FPGA Nomenclature
33
4/4/2011
• Virtex-II / Pro
– 44,000 Logic Slices
– 444 18Kbits BRAMs
– 444 18x18 Multipliers
– 2 PowerPC
Processors
– 20 Gbit I/O
– 1164 Max User I/O
34
4/4/2011
• 1 tto 4 P
PowerPCs
PC
• 4 to 16 multi-gigabit
PowerPCs
transceivers
• 12 to 216 multipliers
• 3,000 to 50,000 logic
cells
• 200k to 4M bits RAM
Logic
cells
• 204 to 852 I/Os
35
4/4/2011
36
4/4/2011
Soft Core
• As opposed to embedding a microprocessor physically into the
fabric of the chip, it is possible to configure a group of
programmable
p g logic
g blocks to act as a microprocessor.
p
• Soft cores are simpler (more primitive) and slower than their
hard-core counterparts.
ADVANTAGE?
Virtex Architectures
Built for high-performance applications
Basic Architecture 74
37
4/4/2011
XtremeDSP Functionality -
Embedded multipliers
XCITE Digitally
Controlled Impedance -
Any I/O
DCM™ Digital Clock
Management - 12
Virtex-4 Family
Advanced Silicon Modular BLock (ASMBL) Architecture
Optimized for logic, Embedded, and Signal Processing
LX FX SX
Resource
Logic 14K–
14K –200K LCs 12K–
12K –140K LCs 23K–
23K –55K LCs
Memory 0.9–
0.9–6 Mb 0.6–
0.6–10 Mb 2.3–
2.3–5.7 Mb
SelectIO 240–
240–960 240–
240–896 320–
320–640
N/A
RocketIO N/A 0–24 Channels
N/A N/A
PowerPC 1 or 2 Cores
N/A N/A
Ethernet MAC 2 or 4 Cores
Basic Architecture 76
38
4/4/2011
Virtex-4 Architecture
RocketIO™
Multi-Gigabit Smart RAM
New block RAM/FIFO
Transceivers
622 Mbps–10.3 Gbps
Xesium Clocking
Advanced CLBs Technology
200K Logic Cells 500 MHz
Tri-Mode
Ethernet MAC
XtremeDSP™ 10/100/1000 Mbps
Technology Slices
256 18x18 GMACs
1 Gbps SelectIO™
PowerPC™ 405 ChipSync™ Source synch,
with APU Interface XCITE Active Termination
450 MHz, 680 DMIPS
Basic Architecture 77
Virtex-5 Family
Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity
Basic Architecture 78
39
4/4/2011
Virtex-5 Architecture
Enhanced New
36Kbit Dual
Dual--Port Block RAM / Most Advanced High-
High-
g
FIFO with Integrated ECC Performance Real 6LUT Logic
Fabric
550 MHz Clock Management Tile PCI Express® Endpoint Block
with DCM and PLL
Basic Architecture 79
TheBuiltSpartan-3 Family
for high volume, low-cost applications
Spartan-3 Bank
Bank 3
Bank 1
Bank
2
4 I/O Banks,
Support for
Up to eight on-chip all I/O Standards
Digital Clock Managers including
to support multiple PCI, DDR333,
system clocks RSDS, mini-LVDS
Basic Architecture 80
40
4/4/2011
Spartan-3 Family
Based upon Virtex-II Architecture – Optimized for Lower Cost
Basic Architecture 81
pairs
– Left-hand SLICEM Slice X1Y0
SHIFTIN
(Memory) Switch
Matrix
• LUTs can be
g
configured y
as memory Slice X0Y1
or SRL16
– Right-hand SLICEL Slice X0Y0
Fast Connects
(Logic)
• LUT can be used as CIN
SHIFTOUT CIN
logic only
Basic Architecture 82
41
4/4/2011
Basic Architecture 83
Spartan-3E Features
• More gates per I/O than • 16 BUFGMUXes on left
Spartan-3
Spartan 3 and right sides
• Removed some I/O – Drive half the chip only
standards – In addition to eight global
clocks
– Higher-drive LVCMOS
– GTL, GTLP • Pipelined multipliers
– SSTL2_II • Additional configuration
– HSTL II 18 HSTL_I,
HSTL_II_18, HSTL I modes
HSTL_III – SPI, BPI
– LVDS_EXT, ULVDS – Multi-Boot mode
• DDR Cascade
– Internal data is presented
on a single clock edge
Basic Architecture 84
42
4/4/2011
Basic Architecture 85
Spartan-3A DSP
Tuning DSP Performance
43
4/4/2011
Function
DSP48 Comparison
DSP48 DSP48E DSP48A Benefit
Pre-Adder No No Yes Reduces the critical path timing in FIR filter applications better
performance. Important in FIR filter construction.
Cascade Output Yes Yes Yes Enables fast data path chaining of DSP48 blocks for larger filters.
Dedicated C No Yes Yes The C input supports many 3-input mathematical functions, such as 3-
input input addition and 2-input multiplication with a single addition and the
very valuable rounding of multiplication away from zero.
Adder 3 input 48 3 input 48 2 input 48 Supports simple add and accumulate functions.
bit bit bit
Dynamic Yes Yes Yes One DSP48 can provide more than one function.. Multiply, Multiply-add,
Opmodes multiply-accumulate etc.
ALU Logic No Yes No Similar to the ALU of a microprocessor. Enables the selection of ALU
Functions function on a clock cycle basis Enables multiple functions to be selected.
(Add, Subtract, or Compare)
Pattern Detect No Yes No This feature supports convergent rounding, underflow/overflow detection
for saturation arithmetic, and auto-resetting counters/accumulators.
Carry Signals Carry In Carry In & Carry In & Supports fast carry functions between DSP blocks. Often a speed
Out Out limiting path.
Basic Architecture 87
Basic Architecture 88
44
4/4/2011
Latest Families
Basic Architecture 89
Architecture Alignment
Virtex-6 FPGAs Spartan-6 FPGAs
BlockRAM
DSP Slices
High-performance Clocking
45
4/4/2011
Spartan-6 LXT
Virtex-6 LXT
Lowest logic +
high-speed serial Virtex-6 SXT
DSP + logic +
serial connectivity
Designers Eccentrics
• Higher System Performance
– More design margin to simplify designs
– Higher integrated functionality
• Lower System Cost
– Reduce BOM
– Implement design in a smaller device & lower speed-
grade
• Lower Power
– Help meet power budgets
– Eliminate heat sinks & fans
– Prevent
Basic Architecture 92 thermal runaway
46
4/4/2011
Virtex-6 Family
Basic Architecture 93
Virtex-5
6
65-nm
Virtex-4
90-nm
Virtex-II Pro
130-nm
Virtex-II
150-nm
Virtex-E
180 nm
180-nm
Virtex
220-nm
1st Generation 2nd Generation 3rd Generation 4th Generation 5th Generation 6th Generation
47
4/4/2011
(slice_m) LUT
Basic Architecture 96
48
4/4/2011
Basic Architecture 97
Basic Architecture 98
49
4/4/2011
Spartan-6 Family
Basic Architecture 99
Spartan-6
• Next Generation 45nm Spartan Family
– Increased performance & density
– Evolutionary feature enhancements
– Dramatic cost & power reductions
50
4/4/2011
Slice mix chosen for the optimal balance of Cost, Power & Performance
Basic Architecture 102
51
4/4/2011
52
4/4/2011
Memory Controller
• Only low cost FPGA with a “hard” memory controller
• Guaranteed
G t d memory interface
i t f performance
f providing
idi
– Reduced engineering & board design time
– DDR, DDR2, DDR3 & LP DDR support
– Up to 12.8Mbps bandwidth for each memory controller
• Multiport
M lti t structure
t t ffor user iinterface
t f SRAM
DRAM
DDR
– Six 32-bit programmable ports from fabric Spartan-6 DDR2
FLASH DDR3
– Controller interface to 4, 8 or 16 bit memories devices LP DDR
EEPROM
53
4/4/2011
Compare to Spartan-3A
Twice the Capabilities, Half the Power, Hard Blocks!
Feature Extended Spartan-3A (90nm) Spartan-6 (45nm)
54
4/4/2011
** All memory controller support x16 interface, except in CS225 package where x8 only is supported
Basic Architecture 109
55
4/4/2011
entity RC5_core is
Functional simulation
port(
clock, reset, encr_decr: in std_logic;
data_input: in std_logic_vector(31 downto 0);
data_output: out std_logic_vector(31 downto 0);
out_full: in std_logic;
key_input: in std_logic_vector(31 downto 0);
key_read: out std_logic;
);
end AES_core;
Synthesis
Post-synthesis
y simulation
Configuration
On chip testing
56
4/4/2011
Logic Synthesis
signal A1:STD_LOGIC;
signal B1:STD_LOGIC;
signal Y1:STD_LOGIC;
signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC;
begin
A1<=A when (NEG_A='0') else
not A;
B1<=B when (NEG_B='0') else
not B;
Y<=Y1 when (NEG_Y='0') else
not Y1;
end MLU_DATAFLOW;
57
4/4/2011
Synthesis Tools
XST
… and others
• Interpret
p RTL code
• Synplify Pro: Produces synthesized circuit netlist in a standard
EDIF (.edf) format
– Can optionally produce .VHM (VHDL code merged into one) file
for post-synthesis simulation
• XST: Produces synthesized circuit netlist in NGC format
• Netlist is composed of gates in the particular Xilinx
implementation library
– http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has
information on libraries
• Give preliminary performance estimates
• Some can display circuit schematics corresponding to EDIF
netlist
58
4/4/2011
Implementation
59
4/4/2011
Mapping
LUT0
LUT4
LUT1
FF1
LUT5
LUT2
FF2
LUT3
60
4/4/2011
Placing FPGA
CLB SLICES
Routing FPGA
Programmable Connections
61
4/4/2011
Design Information
------------------
Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k
4
-cm area -c 100 -tx off exam1.ngd exam1.pcf
Target Device : xc2s200
Target Package : fg256
Target Speed : -6
Mapper Version : spartan2 -- $Revision: 1.26.6.4 $
Mapped Date : Wed Nov 02 11:15:15 2005
Map report
Design Summary
--------------
Number of errors: 0
Number of warnings: 0
Logic Utilization:
Number of Slice Flip Flops: 144 out of 4,704 3%
Number of 4 input LUTs: 173 out of 4,704 3%
Logic Distribution:
Number of occupied Slices: 145 out of 2,352 6%
Number of Slices containing only related logic: 145 out of 145 100%
Number of Slices containing
g unrelated logic:
g 0 out of 145 0%
*See NOTES below for an explanation of the effects of unrelated logic
Total Number 4 input LUTs: 210 out of 4,704 4%
Number used as logic: 173
Number used as a route-thru: 5
Number used as 16x1 RAMs: 32
Number of bonded IOBs: 74 out of 176 42%
Number of GCLKs: 1 out of 4 25%
Number of GCLKIOBs: 1 out of 4 25
62
4/4/2011
--------------------------------------------------------------------------------
Constraint | Requested | Actual | Logic
| | | Levels
--------------------------------------------------------------------------------
TS_clk = PERIOD TIMEGRP "clk" 11.765 ns | 11.765ns | 11.622ns | 13
HIGH 50% | | |
--------------------------------------------------------------------------------
OFFSET = OUT 11.765 ns AFTER COMP "clk" | 11.765ns | 11.491ns | 1
--------------------------------------------------------------------------------
OFFSET = IN 11.765 ns BEFORE COMP "clk" | 11.765ns | 11.442ns | 2
--------------------------------------------------------------------------------
Design statistics:
Minimum period: 11.622ns (Maximum frequency:
86.044MHz)
Minimum input required time before clock: 11.442ns
Minimum output required time after clock: 11.491ns
63
4/4/2011
Post-place-and-route simulation
• After place-and-route performed, can do
post-place-and-route
t l d t simulation
i l ti
– Now have real timing information!
– Also can do static timing analysis: shows the
worst case critical path in circuit
Configuration
• Once a design is implemented, you must create
a file that the FPGA can understand
– This file is called a bit stream: a BIT file (.bit
extension)
64
4/4/2011
65
4/4/2011
State-of-the-Art FPGAs
• 65-90 nm process on 300 mm wafers
• Lower cost per function (LUT + register)
• Smaller and faster transistors: Higher speed
• System speed up to 500 MHz
• Mainly through smart interconnects, clock management,
dedicated circuits, flexible I/O.
• Integrated transceivers running at 10 Gigabits/sec
• More Logic and Better Features:
• >100,000 LUTs & flip-flops
• >200 embedded RAMs, and same number 18 x 18 multipliers
• 1156 pins
i (balls)
(b ll ) with
ith >800 GP I/O
• 50 I/O standards, incl. LVDS with internal termination
• 16 low-skew global clock lines
• Multiple clock management circuits
• On-chip microprocessor(s) and multi-Gbps transceivers
66
4/4/2011
• 24 high-speed serial
transceivers (622Mb/s to • Serial I/O up to 1Gb/s
11.1Gb/s)
• Up to four PowerPC 405 • No hard processor cores
cores
21 X Bigger
C a p a c ity
S peed
P ric e
5.5 X Faster
50 X Less Expensive
67
4/4/2011
FPGA Shortcomings
• Circuit Delay
• Delay increases due to programmable switches in the
FPGA routing architecture
• Area
• Configuration cells and programmable resources
incur substantial area penalty
• Power
• Typically not suited for low power applications
Performance Cost Time to market
Conclusion
68