You are on page 1of 71

Programmable Logic Devices

Tutorial 8
Michal Kubíček
Department of Radio Electronics, FEEC BUT Brno
Vytvořeno za podpory projektu OP VVV Moderní a otevřené studium techniky CZ.02.2.69/0.0/0.0/16_015/0002430.
Tutorial 8

FPGAs in detail
❑ Logic cells
❑ FPGA architecture
❑ Memories in FPGA

page 2 kubicek@vutbr.cz
Logic cells
LUT / MUX technology

page 3 kubicek@vutbr.cz
Logic cells

FPGA versus CPLD


❑ The same basic structure: configurable logic cells connected with a
programmable interconnect structure, complemented with configurable IO cells.

FPGA CPLD

page 4 kubicek@vutbr.cz
Logic cells

FPGA versus CPLD


So what is the (main) difference? CPLD cell
coarse grain
FPGA cell architecture
fine grain
architecture
Flip-
Flop

Inputs
Function
generator
Inputs

Function Flip-
generator Flop

page 5 kubicek@vutbr.cz
Logic cells

Logic cell
composed of combinatorial and sequential block

CPLD

page 6
Logic cells

Logic cell
composed of combinatorial and sequential block

FPGA

page 7 kubicek@vutbr.cz
Logic cells

Logic cell
composed of combinatorial and sequential block
❑ FPGA architecture is Medium Grained, each cell
performs relatively simple logic function
❑ It leads to better area/utilization efficiency,
higher maximum frequency, but puts higher
requirements on programmable interconnect
❑ Granularity: CPLD is considered to be Coarse
Grained, while ASIC to be Fine Grained.

page 8 kubicek@vutbr.cz
Combinatorial function generator

page 9 kubicek@vutbr.cz
Logic cells

MAP process
RTL Schematic Technology Schematic

page 10 kubicek@vutbr.cz
Logic cells

MAP result each LUT has its content (defined logic function)
O = ((I0 * I1 * !I2 * !I3) + (!I0 * !I1 * I2 * I3) + (I0 * I1 * I2 * I3) + (!I0 * !I1 *
!I2 * !I3));
Logic cells

Combinatorial portion of the logic cell


FPGA:
Inputs

Function Flip-
generator Flop

Two variants:
• based on multiplexers (MUX based)
• based on look-up table (LUT based)

page 12 kubicek@vutbr.cz
Logic cells
"Y = 1 when C = 1 or when
A and B = 1"

MUX based LUT based

page 13 kubicek@vutbr.cz
Logic cells

Combinatorial portion of the logic cell


❑ The MUX structure is better suited for data control and switching logic
implementationje.
❑ The LUT structure is MUCH better for arithmetic function implementation.
Moreover, it is easier to handle for automatic synthesis tools (Computer
Aided Design; CAD)

The LUT architecture dominates since early 90s, virtually no other FPGAs
are now available today (which doesn't mean that MUX based FPGAs will not
appear again sometime in a future).

page 14 kubicek@vutbr.cz
Logic cells

Optimum LUT size (number of inputs)


LUT size is directly dependent on the number of its inputs:
• 3 inputs => 8 memory cells
• 4 inputs => 16 memory cells (Spartan-3, Virtex-II a 4)
• 5 inputs => 32 memory cells
• 6 inputs => 64 memory cells (Spartan-6, Virtex-5,6,7)
• 10 inputs => 1024 memory cells
More inputs ➔ larger address decoder (AND stage) ➔ slower propagation (delay)
Large LUTs are inefficient for implementation of simple logic functions (with few inputs)
Some of first FPGAs feature heterogeneous structure composed of both 3-input and 4-input
LUTs with the aim to better utilize silicon area. However, this structure was not well suited for
rather simple implementation tool (synthesizers, mappers) of the day.

page 15 kubicek@vutbr.cz
Logic cells

Optimum LUT size (number of inputs)


So small LUTs are better because they are faster and more efficient, right?
No, not really: for more complex logic functions (with more inputs) it is necessary to
chain several simple LUTs ➔ programmable interconnect must be used ➔ the propagation
delay is rising significantly!
A compromise is needed – the choice is upon FPGA manufacturers.
CPLD – each logic cell can implement a complex logic function ➔ CPLD is efficient for small
number of complex functions. When a functions is so complex that it cannot fit into a single
cell, several cells must be chained. Each cell by itself has already relatively large propagation
delay ➔ propagation through several cells severely deteriorates system performance (FMAX).
FPGA – even relatively simple logic functions must use several logic cells. BUT there is a huge
number of Flip-Flops available in FPGA ➔ it is easy and "cheap" to use intensive pipelining ➔
logic cells are used efficiently while system performance is not affected.

page 16 kubicek@vutbr.cz
Logic cells

Optimum LUT size (number of inputs)


Spartan-3: 4-input LUT

page 17
Logic cells

Optimum LUT size (number of inputs)


Virtex-5: 6-input LUT

page 18
Logic cells

Optimum LUT size (number of inputs)


Virtex-7U: 6-input LUT / 2x5-input LUT

page 19
Logic cells

Optimum LUT size (number of inputs)


Today usually several small LUTs are grouped into a logic cell instead of a single large LUT ➔
more efficient structure (puts higher requirements on implementation tools).

Adaptive Logic Block (ALM)


Stratix-10 (Intel)

page 20
Logic cells

Optimum LUT size (number of inputs)


Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.

Adaptive Logic Block (ALM)


Stratix-10 (Intel)

Real implementation:
one pseudo 6-input LUT is
composed of four 4-input
LUTs

page 21
Logic cells

Optimum LUT size (number of inputs)


Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.

Adaptive Logic Block (ALM)


Stratix-10 (Intel)

The result: very flexible


(efficient) structure

page 22
Alternative use of a LUT
LUT = only a logic function generator?

page 23 kubicek@vutbr.cz
LUT: other functions

Alternative LUT functions


Each 4-input LUT is composed of 16 SRAM cells = 16 registers (flip-flops)
By a small modification of the LUT (internal interconnection of the registers) it is possible to
enable alternative usage of these registers (most FPGAs enable this today).
Not each LUT in the FPGA is capable of all the alternative functions (depends on particular FPGA).

Combinatorial function of four input and one


Comb output variable.

16b RAM memory ("distributed RAM").


LUT4 RAM

Up to 16b dynamic length shift register;


Shift
input signals are used to adjust the shift register
REG
length.
page 24
LUT = RAM/ROM

page 25 kubicek@vutbr.cz
LUT: other functions

LUT as a RAM/ROM
16 x 1b Data
Distributed RAM/ROM
Small and fast memories
16 x 1b

Data
16 x 1b
Address 16 x 1b

Address 16 x 1b

Theoretical maximum of distributed


memory 16 x 4b (64b)
RAM ins some FPGAs
Spartan-3 xc3s500: 73 kb
Virtex-7 UltraScale+: 46 Mb

page 26 kubicek@vutbr.cz
LUT: other functions

LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)

page 27 kubicek@vutbr.cz
LUT: other functions

LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)

page 28 kubicek@vutbr.cz
How to use the distributed
RAM in your design?

page 29 kubicek@vutbr.cz
LUT: other functions

How to use the distributed RAM


Several options; the same is valid for other primitive components
❑ Inference: synthesizer/mapper is able to extract the required functionality from a generic HDL
code. Best portability and flexibility, but not functional for all primitive components.
❑ Instantiation: direct use of primitive components (templates and description can be found in
Language Templates and Architecture Libraries Guide). Not portable, hard to code, but
fully controllable.
❑ IP Core / Wizard: easy to use, optimized implementation, but not portable, less flexible, not
available for all primitive components.
❑ Special macros: for example Xilinx Parametrized Macros (XPM); HDL code (as instantiation)
but easier usage, assured optimization and worse portability (as IP Core / Wizard).

page 30 kubicek@vutbr.cz
LUT: other functions

Design – inference
Distributed RAM/ROM – using a VHDL code

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
---------------------------------------------------------
ENTITY RAM_64x8 IS
PORT (
clk : IN STD_LOGIC;
WR : IN STD_LOGIC;
ADDR : IN STD_LOGIC_VECTOR (5 DOWNTO 0);
D_in : IN STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out : OUT STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out_reg : OUT STD_LOGIC_VECTOR (7 DOWNTO 0));
END RAM_64x8;

page 31 kubicek@vutbr.cz
LUT: other functions

Design – inference
Distributed RAM/ROM – using a VHDL code

-- memory declaration and initialization

TYPE RamType IS ARRAY (0 TO 63) OF


STD_LOGIC_VECTOR(7 DOWNTO 0);

SIGNAL ram_1 : RamType := (X"78", X"76", X"30",


OTHERS => X"00");

page 32 kubicek@vutbr.cz
LUT: other functions

Design – inference
Distributed RAM/ROM – using a VHDL code

-- data write

RAM_write_proc : PROCESS (clk) BEGIN


IF rising_edge(clk) THEN
IF WR = '1' THEN
ram_1(to_integer(unsigned(ADDR))) <= D_in;
END IF;
END IF;
END PROCESS RAM_write_proc;

page 33 kubicek@vutbr.cz
LUT: other functions

Design – inference
Distributed RAM/ROM – using a VHDL code

-- asynchronous read

D_out <= ram_1(to_integer(unsigned(ADDR)));

-- synchronous read

RAM_read_proc : PROCESS (clk) BEGIN


IF rising_edge(clk) THEN
D_out_reg <= ram_1(to_integer(unsigned(ADDR)));
END IF;
END PROCESS RAM_read_proc;

page 34 kubicek@vutbr.cz
LUT: other functions

Design – inference
Distributed RAM/ROM or Block RAM/ROM – using a VHDL code

-- distributed RAM

attribute RAM_STYLE : string;


attribute RAM_STYLE of ram_1: signal is "DISTRIBUTED";

-- block RAM
attribute RAM_STYLE of ram_1: signal is "BLOCK";

Complete list of attribute values (XST 6 and 7 series FPGAs):


{auto|block|distributed|pipe_distributed|block_power1|block_power2}

page 35 kubicek@vutbr.cz
LUT: other functions

Design – instantiation
Distributed RAM/ROM – VHDL instantiation (Language templates)

RAM16X1D_1_inst : RAM16X1D
generic map (
INIT => X"0000")
port map (
DPO => DPO, -- Read-only 1-bit data output for DPRA
SPO => SPO, -- R/W 1-bit data output for A0-A3
A0 => A0, -- R/W address[0] input bit
A1 => A1, -- R/W address[1] input bit
A2 => A2, -- R/W address[2] input bit
A3 => A3, -- R/W ddress[3] input bit
D => D, -- Write 1-bit data input
DPRA0 => DPRA0, -- Read-only address[0] input bit
DPRA1 => DPRA1, -- Read-only address[1] input bit
DPRA2 => DPRA2, -- Read-only address[2] input bit
DPRA3 => DPRA3, -- Read-only address[3] input bit
WCLK => WCLK, -- Write clock input
WE => WE -- Write enable input
);

page 36 kubicek@vutbr.cz
LUT: other functions

Design – IP Core / Wizard


Distributed RAM/ROM – IP Core Generator (Xilinx ISE)

page 37 kubicek@vutbr.cz
LUT: other functions

Design – IP Core / Wizard


Distributed RAM/ROM – IP Core Generator (Xilinx Vivado)

page 38 kubicek@vutbr.cz
LUT: other functions

Design – XPM (Language Templates)

page 39 kubicek@vutbr.cz
LUT = shift register

page 40 kubicek@vutbr.cz
LUT: other functions

LUT as a shift register


Very efficient structure for delaying of data signals
There must be no RESET, SET or LOAD function described in the code in order to be correctly
extracted (inferred). CLOCK ENABLE function is allowed. It is possible to use address input ➔
shift register with dynamically adjustable length.

MUX Q

D_in ...
D_out
clk

adr

page 41 kubicek@vutbr.cz
LUT: other functions

LUT as a shift register

Artix-7 LUT:

page 42 kubicek@vutbr.cz
LUT: other functions

LUT as a shift register – inference


TYPE t_shreg IS ARRAY (15 DOWNTO 0) OF
STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL shreg : t_shreg := (X"33", X"22", X"11",
OTHERS => X"00");
...
-- the shift register functionality No reset allowed
shreg_proc : PROCESS (clk) BEGIN (both synchronous and
IF rising_edge(clk) THEN asynchronous)
IF ce = '1' THEN
shreg <= shreg(shreg'HIGH-1 DOWNTO 0) & data_i;
END IF;
END IF;
END PROCESS shreg_proc;

page 43 kubicek@vutbr.cz
LUT: other functions

LUT as a shift register – inference


...
-- Shift register output tap selection (dynamic length functionality)
out_select_proc : PROCESS (shreg, addr) BEGIN
CASE addr IS
WHEN "000" => data_o <= shreg(0);
WHEN "001" => data_o <= shreg(1);
...
END CASE;
END PROCESS out_select_proc;

-- equivallent description (to the previous process)


data_o <= shreg(to_integer(unsigned(addr)));

page 44 kubicek@vutbr.cz
Sequential portion of a logic cell
Register (Flip Flop)

page 45 kubicek@vutbr.cz
Logic cell

Sequential portion of a logic cell


❑ Today always a D-type register
❑ Parametrized (several settings)
❑ Can be set to LATCH or FLIP-FLOP
function; modern FPGAs support only
the FLIP-FLOP function.
❑ There is no saving (in terms of HW
resources) when using the LATCH
functionality instead of the FLIP-
FLOP.
❑ The reset can be asynchronous but it
is not recommended for Xilinx FPGAs.

page 46 kubicek@vutbr.cz
FPGA architecture
Grouping of basic cells

page 47 kubicek@vutbr.cz
FPGA architecture

SLICE (Xilinx FPGAs)


The logic cells are grouped into a bigger units. Xilinx use term SLICE for them. The
slice size is variable and depends on the FPGA architecture.

SLICE = 2 x LUT + 2 x Flip-Flop (Spartan-3)


SLICE = 4 x LUT + 4 x Flip-Flop (Virtex-5)
SLICE = 4 x LUT + 8 x Flip-Flop (Virtex 7)

page 48 kubicek@vutbr.cz
FPGA architecture

SLICE
The CLOCK SIGNAL is common for the
SLICE = 2 x LUT + 2 x Flip-Flop whole slice.
(Spartan-3)
The registers feature a CE (clock enable)
input, which is usually also common for the
whole slice.
The registers have a set/reset input. Only a
single variant of the set/reset function is
supported by the registers (either set or reset,
either synchronous or asynchronous). Any
additional set/reset function (if needed) must
be emulated using a general purpose logic
(LUTs).

page 49 kubicek@vutbr.cz
FPGA architecture

SLICE
A very important part of the logic cells is so called CARRY LOGIC, which is a dedicated high-
speed interconnect of neighboring slices. It is often used to implement arithmetic functions,
thus the name.

It is a set of multiplexers that enable fast


interconnect (chaining) of neighboring slices so that
a general purpose interconnect need not to be
accessed. The general purpose interconnect has
significantly higher latency.
Implementation tools are able to utilize this
structure without any user intervention.

page 50 kubicek@vutbr.cz
FPGA architecture

SLICE
Modern FPGA SLICEs feature additional components that improve their utilization: a XOR gate
(the most demanding logic function), signal switches (to enable independent utilization of the
LUT and the register), switches for LUT function selection, etc.

Spartan-3: ½ SLICE

page 51 kubicek@vutbr.cz
FPGA architecture

Configurable Logic Block


A higher hierarchical unit in Xilinx FPGAs is called Configurable Logic Block (CLB)
There are several SLICEs in each CLB, some slices have limited functionality.

Number of SLICEs per CLB can be different for each


FPGA family (architecture):
Spartan-3:
CLB = 4 x SLICE
SLICE = 2 x (LUT + Flip-Flop)
Virtex-6:
CLB = 2 x SLICE
SLICE = 4 x (LUT + Flip-Flop)

CLB = 4 x SLICE (Spartan-3)

page 52 kubicek@vutbr.cz
FPGA architecture

Configurable Logic Block


Virtex-5: each CLB contains 2 fully featured SLICEs and 2 simplified SLICEs
FPGA architecture

Xilinx FPGA: Clock regions, TILEs, Super Logic Regions

page 54
Memories in FPGA

page 55 kubicek@vutbr.cz
Block RAM memories

FPGA: block RAM

page 56 kubicek@vutbr.cz
Block RAM memories

BRAM
Properties:
• True Dual Port
• Works at FPGA core clock frequency
• Synchronous, optional output registers
• Native support of ECC

Usage
• Fast memories (RAMs, CACHEs)
• Core of a FIFO memory
• Frame/packet buffers
• ROM memories

page 57 kubicek@vutbr.cz
Block RAM memories

BRAM 36 kb
Available modes:

1 x 36k 2 x 18k
• 32K x 1 • 16K x 1
• 16K x 2 • 8K x 2
• 8K x 4 • 4K x 4
• 4K x 9 • 2K x 9
• 2K x 18 • 1K x 18
• 1K x 36 • 512 x 36
• 512 x 72

The mode can be different for each port

page 58 kubicek@vutbr.cz
Block RAM memories

Write and read process (write first mode)

page 59 kubicek@vutbr.cz
Block RAM memories

Write and read process (read first mode)

page 60 kubicek@vutbr.cz
Block RAM memories

Write and read process (no change mode)

page 61 kubicek@vutbr.cz
Block RAM memories

BRAM as a core of a FIFO memory

page 62
Block RAM memories

BRAM as a core of a FIFO memory

page 63
Block RAM memories

BRAM as a core of a FIFO memory

page 64
Block RAM memories

BRAM – microsequencer (FSM)


Block RAM memories

Spartan-3, Virtex-II,4: BRAM 18k


Spartan-6, Virtex-5,6,7: BRAM 36k
Spartan-3 xc3s200 (15 USD): 12 x 18 kb = 216 kb (27 kB)
Spartan-6 xc6slx4 (12 USD): 4 x 36 kb = 144 kb
Spartan-6 xc6slx150 (165 USD): 134 x 36 kb = 4,8 Mb
Virtex-7 xc7vx1140t (16 800 USD): 1880 x 36 kb = 67,7 Mb
Kintex-7 UltraScale xcku115 (5 500 USD): 75,9 Mb
Virtex-7 UltraScale xcvu440 (55 000 USD): 88,6 Mb (11 MB)

page 66 kubicek@vutbr.cz
RAM memories

Xilinx 7-series UltraScale+: UltraRAM


Larger block memories 4Kx72 (288 Kb = 36 kB)
Up to 432 blocks on a single chip
Virtex-7 UltraScale+ (VU13P)
Distributed RAM 48 Mb (6 MB)
BlockRAM 95 Mb (12 MB)
UltraRAM 360 Mb (45 MB)

page 67 kubicek@vutbr.cz
RAM memories

Xilinx: DRAM memory (HBM) integration into FPGA

page 68 kubicek@vutbr.cz
RAM memories

Xilinx: DRAM memory (HBM) integration into FPGA


Integration of HBM chips directly
into the FPGA package: a silicon
interposer is used to connect FPGA
to the HBM.

page 69 kubicek@vutbr.cz
RAM memories

Intel: DRAM memory (HBM) integration into FPGA


Aggregated throughput up to 512 GB/s

For comparison: 10 x DDR4 DIMM has a


peak throughput of 256 GB/s

page 70 kubicek@vutbr.cz
Thank You for Your attention!

You might also like