Tutorial 08 LUT FPGA Architecture RAM

Programmable Logic Devices
Tutorial 8
Michal Kubíček
Department of Radio Electronics, FEEC BUT Brno
Vytvořeno za podpory projektu OP VVV Moderní a otevřené studium techniky CZ.02.2.69/0.0/0.0/16_015/0002430.
Tutorial 8
FPGAs in detail
❑ Logic cells
❑ FPGA architecture
❑ Memories in FPGA
page 2 kubicek@vutbr.cz
Logic cells
LUT / MUX technology
Logic cells
FPGA versus CPLD

❑ The same basic structure: configurable logic cells connected with a
programmable interconnect structure, complemented with configurable IO cells.
FPGA CPLD
Logic cells
FPGA versus CPLD

So what is the (main) difference? CPLD cell
coarse grain
FPGA cell architecture
fine grain
architecture
Flip-
Flop
Inputs
Function
generator
Inputs
Function Flip-
generator Flop
Logic cells
Logic cell
composed of combinatorial and sequential block
CPLD
page 6
Logic cells
Logic cell
FPGA
Logic cells
Logic cell
❑ FPGA architecture is Medium Grained, each cell
performs relatively simple logic function
❑ It leads to better area/utilization efficiency,
higher maximum frequency, but puts higher
requirements on programmable interconnect
❑ Granularity: CPLD is considered to be Coarse
Grained, while ASIC to be Fine Grained.
Combinatorial function generator
Logic cells
MAP process
RTL Schematic Technology Schematic
Logic cells
MAP result each LUT has its content (defined logic function)
O = ((I0 * I1 * !I2 * !I3) + (!I0 * !I1 * I2 * I3) + (I0 * I1 * I2 * I3) + (!I0 * !I1 *
!I2 * !I3));
Logic cells
Combinatorial portion of the logic cell

FPGA:
Inputs
Function Flip-
generator Flop
Two variants:
• based on multiplexers (MUX based)
• based on look-up table (LUT based)
Logic cells
"Y = 1 when C = 1 or when
A and B = 1"
MUX based LUT based
Logic cells
Combinatorial portion of the logic cell

❑ The MUX structure is better suited for data control and switching logic
implementationje.
❑ The LUT structure is MUCH better for arithmetic function implementation.
Moreover, it is easier to handle for automatic synthesis tools (Computer
Aided Design; CAD)
The LUT architecture dominates since early 90s, virtually no other FPGAs
are now available today (which doesn't mean that MUX based FPGAs will not
appear again sometime in a future).
Logic cells
Optimum LUT size (number of inputs)

LUT size is directly dependent on the number of its inputs:
• 3 inputs => 8 memory cells
• 4 inputs => 16 memory cells (Spartan-3, Virtex-II a 4)
• 6 inputs => 64 memory cells (Spartan-6, Virtex-5,6,7)
More inputs ➔ larger address decoder (AND stage) ➔ slower propagation (delay)
Large LUTs are inefficient for implementation of simple logic functions (with few inputs)
Some of first FPGAs feature heterogeneous structure composed of both 3-input and 4-input
LUTs with the aim to better utilize silicon area. However, this structure was not well suited for
rather simple implementation tool (synthesizers, mappers) of the day.
Logic cells

So small LUTs are better because they are faster and more efficient, right?
No, not really: for more complex logic functions (with more inputs) it is necessary to
chain several simple LUTs ➔ programmable interconnect must be used ➔ the propagation
delay is rising significantly!
A compromise is needed – the choice is upon FPGA manufacturers.
CPLD – each logic cell can implement a complex logic function ➔ CPLD is efficient for small
number of complex functions. When a functions is so complex that it cannot fit into a single
cell, several cells must be chained. Each cell by itself has already relatively large propagation
delay ➔ propagation through several cells severely deteriorates system performance (FMAX).
FPGA – even relatively simple logic functions must use several logic cells. BUT there is a huge
number of Flip-Flops available in FPGA ➔ it is easy and "cheap" to use intensive pipelining ➔
logic cells are used efficiently while system performance is not affected.
Logic cells

Spartan-3: 4-input LUT
page 17
Logic cells

Virtex-5: 6-input LUT
page 18
Logic cells

Virtex-7U: 6-input LUT / 2x5-input LUT
page 19
Logic cells

Today usually several small LUTs are grouped into a logic cell instead of a single large LUT ➔
more efficient structure (puts higher requirements on implementation tools).
Adaptive Logic Block (ALM)

Stratix-10 (Intel)
page 20
Logic cells

Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.

Stratix-10 (Intel)
Real implementation:
one pseudo 6-input LUT is
composed of four 4-input
LUTs
page 21
Logic cells

Today usually several small
LUTs are grouped into a
logic cell instead of a single
large LUT.

Stratix-10 (Intel)
The result: very flexible

(efficient) structure
page 22
Alternative use of a LUT
LUT = only a logic function generator?
LUT: other functions
Alternative LUT functions

Each 4-input LUT is composed of 16 SRAM cells = 16 registers (flip-flops)
By a small modification of the LUT (internal interconnection of the registers) it is possible to
enable alternative usage of these registers (most FPGAs enable this today).
Not each LUT in the FPGA is capable of all the alternative functions (depends on particular FPGA).
Combinatorial function of four input and one

Comb output variable.
16b RAM memory ("distributed RAM").

LUT4 RAM
Up to 16b dynamic length shift register;

Shift
input signals are used to adjust the shift register
REG
length.
page 24
LUT = RAM/ROM
LUT as a RAM/ROM
16 x 1b Data
Distributed RAM/ROM
Small and fast memories
16 x 1b
Data
16 x 1b
Address 16 x 1b
Address 16 x 1b
Theoretical maximum of distributed

memory 16 x 4b (64b)
RAM ins some FPGAs
Spartan-3 xc3s500: 73 kb
Virtex-7 UltraScale+: 46 Mb
LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)
LUT as a RAM/ROM
Distributed RAM/ROM
Modes (Virtex-5)
How to use the distributed
RAM in your design?
How to use the distributed RAM

Several options; the same is valid for other primitive components
❑ Inference: synthesizer/mapper is able to extract the required functionality from a generic HDL
code. Best portability and flexibility, but not functional for all primitive components.
❑ Instantiation: direct use of primitive components (templates and description can be found in
Language Templates and Architecture Libraries Guide). Not portable, hard to code, but
fully controllable.
❑ IP Core / Wizard: easy to use, optimized implementation, but not portable, less flexible, not
available for all primitive components.
❑ Special macros: for example Xilinx Parametrized Macros (XPM); HDL code (as instantiation)
but easier usage, assured optimization and worse portability (as IP Core / Wizard).
Design – inference
Distributed RAM/ROM – using a VHDL code
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
---------------------------------------------------------
ENTITY RAM_64x8 IS
PORT (
clk : IN STD_LOGIC;
WR : IN STD_LOGIC;
ADDR : IN STD_LOGIC_VECTOR (5 DOWNTO 0);
D_in : IN STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out : OUT STD_LOGIC_VECTOR (7 DOWNTO 0);
D_out_reg : OUT STD_LOGIC_VECTOR (7 DOWNTO 0));
END RAM_64x8;
-- memory declaration and initialization
TYPE RamType IS ARRAY (0 TO 63) OF

STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL ram_1 : RamType := (X"78", X"76", X"30",

OTHERS => X"00");
-- data write
RAM_write_proc : PROCESS (clk) BEGIN

IF rising_edge(clk) THEN
IF WR = '1' THEN
ram_1(to_integer(unsigned(ADDR))) <= D_in;
END IF;
END IF;
END PROCESS RAM_write_proc;
-- asynchronous read
D_out <= ram_1(to_integer(unsigned(ADDR)));
-- synchronous read
RAM_read_proc : PROCESS (clk) BEGIN

IF rising_edge(clk) THEN
D_out_reg <= ram_1(to_integer(unsigned(ADDR)));
END IF;
END PROCESS RAM_read_proc;
Distributed RAM/ROM or Block RAM/ROM – using a VHDL code
-- distributed RAM
attribute RAM_STYLE : string;

attribute RAM_STYLE of ram_1: signal is "DISTRIBUTED";
-- block RAM
attribute RAM_STYLE of ram_1: signal is "BLOCK";
Complete list of attribute values (XST 6 and 7 series FPGAs):

{auto|block|distributed|pipe_distributed|block_power1|block_power2}
Design – instantiation
Distributed RAM/ROM – VHDL instantiation (Language templates)
RAM16X1D_1_inst : RAM16X1D
generic map (
INIT => X"0000")
port map (
DPO => DPO, -- Read-only 1-bit data output for DPRA
SPO => SPO, -- R/W 1-bit data output for A0-A3
A0 => A0, -- R/W address[0] input bit
A3 => A3, -- R/W ddress[3] input bit
D => D, -- Write 1-bit data input
DPRA0 => DPRA0, -- Read-only address[0] input bit
WCLK => WCLK, -- Write clock input
WE => WE -- Write enable input
);
Design – IP Core / Wizard

Distributed RAM/ROM – IP Core Generator (Xilinx ISE)
Design – IP Core / Wizard

Distributed RAM/ROM – IP Core Generator (Xilinx Vivado)
Design – XPM (Language Templates)
LUT = shift register
LUT as a shift register

Very efficient structure for delaying of data signals
There must be no RESET, SET or LOAD function described in the code in order to be correctly
extracted (inferred). CLOCK ENABLE function is allowed. It is possible to use address input ➔
shift register with dynamically adjustable length.
MUX Q
D_in ...
D_out
clk
adr
LUT as a shift register
Artix-7 LUT:
LUT as a shift register – inference

TYPE t_shreg IS ARRAY (15 DOWNTO 0) OF
STD_LOGIC_VECTOR(7 DOWNTO 0);
SIGNAL shreg : t_shreg := (X"33", X"22", X"11",
OTHERS => X"00");
...
-- the shift register functionality No reset allowed
shreg_proc : PROCESS (clk) BEGIN (both synchronous and
IF rising_edge(clk) THEN asynchronous)
IF ce = '1' THEN
shreg <= shreg(shreg'HIGH-1 DOWNTO 0) & data_i;
END IF;
END IF;
END PROCESS shreg_proc;
LUT as a shift register – inference

...
-- Shift register output tap selection (dynamic length functionality)
out_select_proc : PROCESS (shreg, addr) BEGIN
CASE addr IS
WHEN "000" => data_o <= shreg(0);
WHEN "001" => data_o <= shreg(1);
...
END CASE;
END PROCESS out_select_proc;
-- equivallent description (to the previous process)

data_o <= shreg(to_integer(unsigned(addr)));
Sequential portion of a logic cell
Register (Flip Flop)
Logic cell
Sequential portion of a logic cell

❑ Today always a D-type register
❑ Parametrized (several settings)
❑ Can be set to LATCH or FLIP-FLOP
function; modern FPGAs support only
the FLIP-FLOP function.
❑ There is no saving (in terms of HW
resources) when using the LATCH
functionality instead of the FLIP-
FLOP.
❑ The reset can be asynchronous but it
is not recommended for Xilinx FPGAs.
FPGA architecture
Grouping of basic cells
FPGA architecture
SLICE (Xilinx FPGAs)

The logic cells are grouped into a bigger units. Xilinx use term SLICE for them. The
slice size is variable and depends on the FPGA architecture.
SLICE = 2 x LUT + 2 x Flip-Flop (Spartan-3)

SLICE = 4 x LUT + 4 x Flip-Flop (Virtex-5)
SLICE = 4 x LUT + 8 x Flip-Flop (Virtex 7)
FPGA architecture
SLICE
The CLOCK SIGNAL is common for the
SLICE = 2 x LUT + 2 x Flip-Flop whole slice.
(Spartan-3)
The registers feature a CE (clock enable)
input, which is usually also common for the
whole slice.
The registers have a set/reset input. Only a
single variant of the set/reset function is
supported by the registers (either set or reset,
either synchronous or asynchronous). Any
additional set/reset function (if needed) must
be emulated using a general purpose logic
(LUTs).
FPGA architecture
SLICE
A very important part of the logic cells is so called CARRY LOGIC, which is a dedicated high-
speed interconnect of neighboring slices. It is often used to implement arithmetic functions,
thus the name.
It is a set of multiplexers that enable fast

interconnect (chaining) of neighboring slices so that
a general purpose interconnect need not to be
accessed. The general purpose interconnect has
significantly higher latency.
Implementation tools are able to utilize this
structure without any user intervention.
FPGA architecture
SLICE
Modern FPGA SLICEs feature additional components that improve their utilization: a XOR gate
(the most demanding logic function), signal switches (to enable independent utilization of the
LUT and the register), switches for LUT function selection, etc.
Spartan-3: ½ SLICE
FPGA architecture
Configurable Logic Block

A higher hierarchical unit in Xilinx FPGAs is called Configurable Logic Block (CLB)
There are several SLICEs in each CLB, some slices have limited functionality.
Number of SLICEs per CLB can be different for each

FPGA family (architecture):
Spartan-3:
CLB = 4 x SLICE
SLICE = 2 x (LUT + Flip-Flop)
Virtex-6:
CLB = 2 x SLICE
SLICE = 4 x (LUT + Flip-Flop)
CLB = 4 x SLICE (Spartan-3)
FPGA architecture
Configurable Logic Block

Virtex-5: each CLB contains 2 fully featured SLICEs and 2 simplified SLICEs
FPGA architecture
Xilinx FPGA: Clock regions, TILEs, Super Logic Regions
page 54
Memories in FPGA
Block RAM memories
FPGA: block RAM
Block RAM memories
BRAM
Properties:
• True Dual Port
• Works at FPGA core clock frequency
• Synchronous, optional output registers
• Native support of ECC
Usage
• Fast memories (RAMs, CACHEs)
• Core of a FIFO memory
• Frame/packet buffers
• ROM memories
Block RAM memories
BRAM 36 kb
Available modes:
1 x 36k 2 x 18k
• 32K x 1 • 16K x 1
• 16K x 2 • 8K x 2
• 8K x 4 • 4K x 4
• 4K x 9 • 2K x 9
• 2K x 18 • 1K x 18
• 1K x 36 • 512 x 36
• 512 x 72
The mode can be different for each port
Block RAM memories
Write and read process (write first mode)
Block RAM memories
Write and read process (read first mode)
Block RAM memories
Write and read process (no change mode)
Block RAM memories
BRAM as a core of a FIFO memory
page 62
Block RAM memories
page 63
Block RAM memories
page 64
Block RAM memories
BRAM – microsequencer (FSM)

Block RAM memories
Spartan-3, Virtex-II,4: BRAM 18k

Spartan-6, Virtex-5,6,7: BRAM 36k
Spartan-3 xc3s200 (15 USD): 12 x 18 kb = 216 kb (27 kB)
Spartan-6 xc6slx4 (12 USD): 4 x 36 kb = 144 kb
Spartan-6 xc6slx150 (165 USD): 134 x 36 kb = 4,8 Mb
Virtex-7 xc7vx1140t (16 800 USD): 1880 x 36 kb = 67,7 Mb
Kintex-7 UltraScale xcku115 (5 500 USD): 75,9 Mb
Virtex-7 UltraScale xcvu440 (55 000 USD): 88,6 Mb (11 MB)
RAM memories
Xilinx 7-series UltraScale+: UltraRAM

Larger block memories 4Kx72 (288 Kb = 36 kB)
Up to 432 blocks on a single chip
Virtex-7 UltraScale+ (VU13P)
Distributed RAM 48 Mb (6 MB)
BlockRAM 95 Mb (12 MB)
UltraRAM 360 Mb (45 MB)
RAM memories
Xilinx: DRAM memory (HBM) integration into FPGA
RAM memories
Xilinx: DRAM memory (HBM) integration into FPGA

Integration of HBM chips directly
into the FPGA package: a silicon
interposer is used to connect FPGA
to the HBM.
RAM memories
Intel: DRAM memory (HBM) integration into FPGA

Aggregated throughput up to 512 GB/s
For comparison: 10 x DDR4 DIMM has a

peak throughput of 256 GB/s
Thank You for Your attention!

Tutorial 08 LUT FPGA Architecture RAM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial 08 LUT FPGA Architecture RAM

Uploaded by

Copyright:

Available Formats

Programmable Logic Devices

FPGA versus CPLD

FPGA versus CPLD

Combinatorial portion of the logic cell

MUX based LUT based

Combinatorial portion of the logic cell

Optimum LUT size (number of inputs)

Optimum LUT size (number of inputs)

Optimum LUT size (number of inputs)

Optimum LUT size (number of inputs)

Optimum LUT size (number of inputs)

Optimum LUT size (number of inputs)

Adaptive Logic Block (ALM)

Optimum LUT size (number of inputs)

Adaptive Logic Block (ALM)

Optimum LUT size (number of inputs)

Adaptive Logic Block (ALM)

The result: very flexible

Alternative LUT functions

Combinatorial function of four input and one

16b RAM memory ("distributed RAM").

Up to 16b dynamic length shift register;

Theoretical maximum of distributed

How to use the distributed RAM

-- memory declaration and initialization

TYPE RamType IS ARRAY (0 TO 63) OF

SIGNAL ram_1 : RamType := (X"78", X"76", X"30",

RAM_write_proc : PROCESS (clk) BEGIN

D_out <= ram_1(to_integer(unsigned(ADDR)));

RAM_read_proc : PROCESS (clk) BEGIN

attribute RAM_STYLE : string;

Complete list of attribute values (XST 6 and 7 series FPGAs):

Design – IP Core / Wizard

Design – IP Core / Wizard

Design – XPM (Language Templates)

LUT as a shift register

LUT as a shift register

LUT as a shift register – inference

LUT as a shift register – inference

-- equivallent description (to the previous process)

Sequential portion of a logic cell

SLICE (Xilinx FPGAs)

SLICE = 2 x LUT + 2 x Flip-Flop (Spartan-3)

It is a set of multiplexers that enable fast

Configurable Logic Block

Number of SLICEs per CLB can be different for each

CLB = 4 x SLICE (Spartan-3)

Configurable Logic Block

Xilinx FPGA: Clock regions, TILEs, Super Logic Regions

FPGA: block RAM

The mode can be different for each port

Write and read process (write first mode)

Write and read process (read first mode)

Write and read process (no change mode)

BRAM as a core of a FIFO memory

BRAM as a core of a FIFO memory

BRAM as a core of a FIFO memory

BRAM – microsequencer (FSM)

Spartan-3, Virtex-II,4: BRAM 18k

Xilinx 7-series UltraScale+: UltraRAM

Xilinx: DRAM memory (HBM) integration into FPGA

Xilinx: DRAM memory (HBM) integration into FPGA

Intel: DRAM memory (HBM) integration into FPGA