You are on page 1of 40

Reconfigurable Computing

AEL ZG 554 / ES ZG 554 / MEL ZG 554


Session 1
Pawan Sharma
BITS Pilani ps@pilani.bits-pilani.ac.in
Pilani Campus 23/07/2022
Today’s Lecture

Course Overview
• Module: Introduction to Reconfigurable
Computing
– General Purpose Computing [T1. Sec 1]
– Domain and Application specific processors [T1. Sec 2
& 3]
– Reconfigurable Computing [T1. Sec 4]
– Fields of Application [T1. Sec 5.1 to 5.4]

2
BITS Pilani, Pilani Campus
Course Overview

Hardwired Technology Software Programmed Processors


• ASIC based or a set of individual • General purpose processors
components forming a board level • Far more flexible than ASIC
solution • Execute a set of instructions to perform
• Designed specifically to perform a computation
given computation • Change instructions to change
• Very fast and efficient when functionality without changing
executing same computation hardware
designed for • Poor performance compared to ASIC
– read each instruction from memory,
• Can not be altered post fabrication decode its meaning, and only then
• Forces redesign and refabricating execute it.
– results in a high execution overhead for
• Very expensive process if already each individual operation
deployed in large no of systems – New operations must be built out of
• Same for board level solutions— existing instructions as ISA determined at
fabrication
cant replace them in events of
change or upgrade in application

BITS Pilani, Pilani Campus


New Trend in Computation

• Reconfigurable Computing aims at filling this gap between hardware


and software by blending benefits of both of them,
• Achieves potentially much higher performance than software, while
maintaining a higher level of flexibility than hardware.
• Emerged as an important organizational structure for implementing
computations.
• It combines the post-fabrication programmability or temporal
computational style of software programmed processors with the
spatial computational style most commonly employed in hardware
designs.
• The result changes traditional “hardware” and “software”
boundaries, providing an opportunity for greater computational
capacity and density within a programmable media.

BITS Pilani, Pilani Campus


Reconfigurable Devices

• Reconfigurable devices, including field-programmable gate


arrays (FPGAs), contain an array of computational
elements whose functionality is determined through
multiple programmable configuration bits.
• These elements, sometimes known as logic blocks, are
connected using a set of programmable routing resources.
• In this way, custom digital circuits can be mapped to the
reconfigurable hardware by computing the logic functions
of the circuit within the logic blocks, and using the
configurable routing to connect the blocks together to
form the necessary circuit.

BITS Pilani, Pilani Campus


y=Ax2 + Bx + C

• a small number of more general compute


resources are reused in time, allowing the
computation to be implemented compactly
• Generalized – can perform many functions
well
• Sequential – inherently constrained even
with multiple data paths
• Fixed logic – data sizes, number of
computational units, etc. cannot be
changed

Spatially Configurable Implementation

• each operator exists at a different point in space, allowing the computation to


exploit parallelism to achieve high throughput and low computational latencies
• Parallelism customized to meet design objectives
• Logic specialization to perform a specific function
• Hardware-level adaptation of functionality to meet changing problem
requirements

With these two options, we implicitly connect spatial processing with hardware computation and temporal processing with software.

BITS Pilani, Pilani Campus


Introduction to FPGAs

• Field-Programmable Gate Arrays


– Literally, an array of logic gates that can be programmed with
new functionality in the field.
• Target Applications
– Image/video processing
– Cryptographic ciphers
– Military and aerospace applications
• What are the advantages of FPGA technology?
– Algorithmic agility / upload
– Cost efficiency
– Resource efficiency
– Throughput

BITS Pilani, Pilani Campus


• FPGAs can be customised to solve any problem after
device fabrication
• Exploit a large degree of spatially customized
computation in order to perform their computation
• Reconfigurable devices have the obvious benefit of
spatial parallelism, allowing them to perform more
operations per cycle.
• FPGAs contain an array of computational elements
• Functionality is determined through multiple
programmable configuration bits.
8
BITS Pilani, Pilani Campus
LUT based logic element

I1 I2 I3 I4
• Each LUT operates on four one-bit inputs
Cout • Output is one data bit
Cout carry
logic
• Can perform any Boolean function of
four inputs
4
• 22 = 65536 functions (4096 patterns)
4-LUT
• The basic logic element can be more
complex (multiplier, ALU, etc.)
DFF
• Contains some sort of programmable
interconnect
OUT

BITS Pilani, Pilani Campus


• These programmable elements, known as logic
blocks, are connected using a set of routing
resources that are also programmable.
• Custom digital circuits can be mapped to the
reconfigurable hardware by computing the logic
functions of the circuit within the logic blocks
• Using the configurable routing to connect the blocks
together to form the necessary circuit
• Machines based on these FPGAs have achieved
impressive performance often achieving 100x the
performance of processor alternatives and 10 - 100 x
the performance per unit of silicon area

BITS Pilani, Pilani Campus


Classify Reconfigurable Systems

• Current reconfigurable computing systems can be


classified by three main design decisions:
• Granularity of programmable hardware
– Low-level components with traditional ASIC design flow?
– More complex base units like multipliers, ALUs, etc.?
• Proximity of the CPU to the programmable hardware
– On the chip? On the bus? On the board? On the network?
• Capacity
– How many equivalent ASIC gates?
– How to allocate resources? Set ratios of memory to
computation to interconnect?

11
BITS Pilani, Pilani Campus
Higher level diagram of FPGA

12
BITS Pilani, Pilani Campus
• FPGAs are composed of the following:
– Configurable Logic Blocks (CLBs)
– Programmable interconnect
– Input/Output Buffers (IOBs)
– Other stuff (clock trees, timers, memory, multipliers,
processors, etc.)
• CLBs contain a number of Look-Up Tables (LUTs) and
some sequential storage.
– LUTs are individually configured as logic gates, or can be
combined into n bit wide arithmetic functions.
– Architecture Specific

BITS Pilani, Pilani Campus


• Major players in the FPGA industry:
– Chipmakers – device families
• Xilinx – Spartan, Spartan-II, Spartan-3, Virtex, Virtex-II
• Actel – eX, MX, SX, Axcelerator, ProASIC
• Intel – ACEX, FLEX, APEX, Cyclone, Mercury, Stratix
• Atmel – AT6000, AT40K
• Software developers – CAD tools
– Synopsys – FPGA Compiler
– Mentor Graphics – HDL Designer, ModelSim
– Synplicity – Synplify, Synplify Pro

14
BITS Pilani, Pilani Campus
Text and Reference Books

T1 Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications. Christophe Bobda, Springer, 2007

T2 Scott Hauck, André DeHon, Reconfigurable Computing - The Theory and Practice of FPGA Based Computation, The Morgan Kaufmann Series in
Systems on Silicon, 2007.

R1 Wolf Wayne, FPGA Based System Design, Pearson Edu, 2004.

R2 Verilog HDL, Samir Palnitkar, Prentice Hall, 2003.

R3 R Vaidyanathan, Trahan Jerry, Dynamic Reconfiguration: Architectures and Algorithms, L, Kluwer Academic, 2003.

R4 Xilinx, Altera and Microsemi Architecture reference manual

R5 Giovanni De Micheli, synthesis and optimization of digital circuits, Tata McGraw-Hill, 2003

R. Druyer, L. Torres, P. Benoit, P. V. Bonzom and P. Le-Quere, "A survey on security features in modern FPGAs," Reconfigurable
Communication-centric Systems-on-Chip (ReCoSoC), 2015 10th International Symposium on, Bremen, 2015, pp. 1-8. doi:
10.1109/ReCoSoC.2015.7238102
R6
https://www.altera.com/en_US/pdfs/literature/wp/wp-01111-anti-tamper.pdf
http://www.xilinx.com/support/documentation/white_papers/wp365_Solving_Security_Concerns.pdf
http://www.microsemi.com/document-portal/doc_view/132850-secure-architecture-in-microsemi-fpgas-and-soc-fpgas-an-overview

15
BITS Pilani, Pilani Campus
Evaluation Components

• Two Lab based assignments (EC-1-online) : 10% + 15%


• Mid-semester Examination (EC-2-open book): 30%
• Comprehensive Examination (EC-3 –open book): 45%

16
BITS Pilani, Pilani Campus
Lab Details
• Use of Xilinx Vivado software
• Software and hardware to be accessed from central
remote lab facilities, using Internet
• Need high speed internet connection to access, preferably
>4 Mbps
• If available, you may also use your own copy of software
and hardware, e.g., open source tool-ISE webpack from
Xilinx (does not support hardware interface)
• Assignments to be uploaded on course page within
timeline
• Lab reference material would be available on course page

BITS Pilani, Pilani Campus


Pre-requisites

• Students should have a basic knowledge of digital logic design,


including basic concepts of logic gates, decoders, multiplexers, flip-
flops, and memory, binary number systems, and simple logic
optimization techniques like K-map algorithm.
• Knowledge of hardware description languages, such as Verilog or
VHDL, is also helpful.
• Have basic knowledge of graph theory, computer programming (not
essential)
• In sum, this course is appropriate for most readers with a
background in electrical engineering, computer science, or
computer engineering.

BITS Pilani, Pilani Campus


Computing Paradigms

• The Von Neumann Computer


• Pipelining
• Domain specific processors (DSP)
• Application Specific Integrated Circuits (ASIC)
• Application specific instruction set processors (ASIP)
• Reconfigurable Processors (FPGA)

BITS Pilani, Pilani Campus


The Von Neumann Computer

A computer could have a simple structure, capable of executing any


kind of program, given a properly programmed control unit,
without the need of hardware modification
• Simplicity in programming
• Follows sequential way of human thinking

Program execution
• Instruction Fetch (IF): The next instruction to be executed is
fetched from the memory
• Decode (D): The instruction is decoded to determine the
operation
• Read operand (R):The operands are read from the memory
• Execute (EX): The required operation is executed on the ALU
• Write result (W): The result of the operation is written back to
the memory
• Instruction execution in Cycle (IF, D, R, EX,W)
• In each of those five cycles, only the part of the hardware
involved in the computation is activated. The rest remains idle

BITS Pilani, Pilani Campus


Advantages:
• Flexibility: any well coded program can be executed.
• Control Unit gets data and instruction in the same way from one memory. It simplifies
design and development of the Control Unit.
• Data from memory and from devices are accessed in the same way.
• Memory organization is in the hands of programmers.

Disadvantages
• Speed efficiency: Not efficient, due to the sequential program execution (temporal
resource sharing).
• Resource efficiency: Only one part of the hardware resources is required for the
execution of an instruction. The rest remains idle.
• One bus is a bottleneck. Only one information can be accessed at the same time.
• Memory access: Memories are about 1000 time slower than the processor
• Instruction stored in the same memory as the data can be accidentally rewritten by an
error in a program.
• Drawbacks are compensated using high clock speed, pipelining, caches, instruction
pre-fetching, etc.

BITS Pilani, Pilani Campus


Pipelining
Sequential execution
• tcycle = cycle execution time
• One instruction needs 5 clcok cycles =
5*tcycle

• 3 instructions executed in 15*tcycle


Pipelining
• One instruction needs 5 clock cycles =
5*tcycle
• No improvement in instruction cycle BUT
increase in throughput
• 3 instructions need 7*tcycle in the ideal case
due to overlap of execution of inst
(pipelining)
• 9*tcycle considering pipelining hazards as in
Harvard arch

Even with pipeline and other improvements like cache, the execution remain sequential.—Amdahl’s constraints

BITS Pilani, Pilani Campus


Domain specific processors
• Data path is tailored for a class of algorithms
• Overcome the drawback of the von Neumann computer – reduced
memory access
• DSP (Digital Signal Processors)
• To speed-up computation of repetitive, numerically intensive tasks in
signal processing areas such as telecommunication, multimedia,
automobile, image processing, etc
• Signal processing applications are usually multiply accumulate (MAC)
dominated.
• Datapath optimized to execute one or many MACs in only one cycle.
• Instruction fetching and decoding overhead is removed due to
availability of specialized hardware and instructions
• Memory access is limited by directly processing the input dataflow

BITS Pilani, Pilani Campus


Processor
Application Specific Instruction Set Processor

• An ASIP is a processor that can be specialized to a particular application domain.


• A compromise between the two extremes ASIC and GPP
• ASIPs and ASICs are designed for a very specific application – But ASIPs are
programmable and therefore almost as flexible as GPP
• Contains static logic for defining minimum ISA and configurable logic to design new
instructions
• ASIPs are great to be integrated in embedded systems or SoC – Design effort is more
on system level rather than on hardware. Goal is to optimize on area and power and
design effort is more on system level compared to hardware
• instruction set of the application is directly implemented in hardware.
• DSPs may also fall under this domain but parallelism is more pronounced in ASIP
• Extending the main processor datapath and act as accelerators for offloading main
processor for performance or power-critical operations
• Act as “activable” accelerators
• Examples: ASIP for Image processing, Network processors, Cryptographic processors,
Cisco Quantum flow processor
BITS Pilani, Pilani Campus
ASIP
Implementation of a VN computer ASIP implementation:
if (a < b) then The complete execution is done in
{ parallel in one clock cycle
d = a+b; run-time = tmax= delay longest path from
c = a*b; input to output
}
else
{
d= a+1;
c= b-1
}

At least 3 instructions in prog


run-time >= 3*tinstruction
3×5×tcycle=15 * tcycle

• The VN processor can compete with this ASIP


only if 15 ∗ tcycle < tmax, i.e. tcycle <
tmax/15. The VN must be at least 15 times
faster than the ASIP to be competitive.

BITS Pilani, Pilani Campus


Application Specific Integrated Circuit (ASIC)

• Optimize the complete circuit for a given function


• Optimization is done by implementing the inherent parallel
structure on a chip
• The data path is optimized for only one application.
• Instruction fetching and decoding overhead is removed
• Instruction set implemented in hardware
• Memory access is limited by directly processing the input data flow
• Exploitation of parallel computation
• ASIC examples: baseband processor in mobile phones, chipsets in
PCs, MPEG encoder/decoder, DSP functions

BITS Pilani, Pilani Campus


When to use RC?

RC devices enable design of digital circuits without fabricating a device


 Therefore, RC can be used anytime a digital circuit is needed

 Examples: ASIC prototyping, ASIC replacement,

replacing/accelerating microprocessors
 But, when should RC be used instead of alternative technologies?

Implementation Possibilities

Microprocessor RC (FPGA,CPLD, etc.) ASIC

Performance
Why not use an ASIC for everything?

BITS Pilani, Pilani Campus


When to use RC?

1. When it provides the cheapest solution


• Depends on:
• NRE Cost - Non-recurring engineering cost: Cost involved with
designing application
• Unit cost - cost of a manufacturing/purchasing a single system
• Volume - # of units
• Total cost = NRE + unit cost * volume
• RC is typically more cost effective for low volume applications
• RC: low NRE, high unit cost
• ASIC: very high NRE, low unit cost

BITS Pilani, Pilani Campus


2. When time to market is critical
– Huge effect on total revenue
3. When circuit may have to be modified
– Can’t change ASIC - hardware
– Can change circuit implemented in FPGA
Uses
– When standards change
• Codec changes after devices fabricated
– Allows addition of new features to existing devices
– Fault tolerance/recovery
– “Partial reconfiguration” allows virtual device with arbitrary size - analogous to
virtual memory
Without RC
– Anything that may have to be reconfigured is implemented in software
• Performance loss

BITS Pilani, Pilani Campus


Reconfigurable Computing
• Reconfigurable computing (RC) is the study of architectures that can
adapt (after fabrication) to a specific application or application domain
– Involves architecture, tools, CAD, design automation, algorithms,
languages, etc.
• Reconfigurable computing can be defined as the study of computations
involving reconfigurable devices.
• Spatial structure of the device is modified such as to use the best
computing approach to speed up that application
• For an application, the device structure will be modified again to
match the new application

BITS Pilani, Pilani Campus


• Alternatively, RC is a way of implementing circuits without fabricating a device
• Essentially allows circuits to be implemented as “software”. Circuits are no longer
synonymous with hardware
• RC devices are programmable by downloading bits, just like microprocessors
• Difference is that microprocessor bits specify instructions, whereas RC bits
specify circuit structures
a b
Microprocessor FPGA Binaries
Binaries (Bitfile)
x c
001010010 001010010

Bits loaded into Bits loaded into logic blocks,


switch matrices, memories, etc. y
program memory

0010 0010
… Processor
Processor … FPGA
Processor

BITS Pilani, Pilani Campus


Some Fields of Application

 Rapid prototyping
 Post fabrication customization
 Multi-modal computing tasks
 Adaptive computing systems
 Fault tolerance
 High performance parallel computing

BITS Pilani, Pilani Campus


Rapid prototyping
• Testing hardware in real conditions
before fabrication
• Software simulation
• Relatively inexpensive
• Slow
APTIX System Explorer
• Accuracy
• Hardware emulation
• Hardware testing under real
operation conditions
• Fast
• Accurate
• Allow several iterations
ITALTEL FLEXBENCH

BITS Pilani, Pilani Campus


In-System customization
 Time to market advantage Manufacturer
o Ship the first version of a product
o Remote upgrading with new product
versions
o Remote repairing
• Mars rover vehicle (Mars Pathfinder launched
4th July 1997)

BITS Pilani, Pilani Campus


Multi-modal computation
Systems that handle many different
types of inputs. Control units
handles in time multiplexed manner.
• mobile phones..
• Built-in Digital Camera Video service request
phone service
Configuration
• Games,
• Internet Navigation system,
• Emergency Diagnostics
• Different standard protocols
• Monitoring
• Entertainment

BITS Pilani, Pilani Campus


Adaptive computing systems

• Computing systems that are able to


adapt their behavior and structure to
changing operating and
environmental conditions, time-
varying optimization objectives, and
physical constraints like changing
protocols, new standards, or
dynamically changing operation
conditions of technical systems.
• Dynamic adaptation to environment
and threats for extended mission
capabilities

BITS Pilani, Pilani Campus


Fault Tolerance

• FPGAs are sensitive to SEU (Single event upset) and SET (Single event
transients) since the configuration memory of the chip can be
affected, resulting in permanent error, due to electromagnetic noise
and radiation and particularly in space applications, cosmic rays can
hit silicon-surfaces causing high-density electron-hole pairs which
may lead to transient errors
• Requires duplication or triplication of resources for combinational
logic and parity check for on-chip caches
• Triple Modular Redundancy (TMR) with a voter circuit is common
approach. Three identical hardware modules perform their
operations in parallel and their output is voted.

BITS Pilani, Pilani Campus


GPU vs FPGA

• Raw Compute Power: Xilinx research shows that the Tesla P40 (40 INT8
TOP/s) with Ultrascale+TM XCVU13P FPGA (38.3 INT8 TOP/s) has almost
the same compute power, XCVU13P with high amount of on-chip cache
memory reduces the memory bottlenecks, flexibility of FPGAs in
supporting the full range of data types precisions, e.g., INT8, FTP32,
binary and any other custom data type
• Efficiency and power: an image classification project showed that Arria
10 FPGA performs almost 10 times better in power consumption, Xilinx
showed that the Xilinx Virtex Ultrascale+ performs almost four times
better than NVidia Tesla V100 in general purpose compute efficiency.

BITS Pilani, Pilani Campus


Analog Domain

• FPAA – Field Programmable Analog Array


• Consists of Configurable Analog Blocks (CABs)
• CABs contain operational amplifiers, programmable capacitor
arrays (PCAs), or programmable resistor arrays (PRAs) for
continuous time circuits or configurable switches for switched
capacitor circuits
• Means they can operate in one of two modes: continuous
time and discrete time.
• Example: AN221E04 device by Anadigm

BITS Pilani, Pilani Campus


Some papers on FPGA Applications

• FPGAs in Industrial Control Applications, Eric Monmasson and


etal..
• Recent Trends in FPGA Architectures and Applications, Philip
H.W. Leong
• Why Compete When You Can Work Together: FPGA-ASIC
Integration for Persistent RNNs, Eriko Nurvitadhi and etal…..

BITS Pilani, Pilani Campus

You might also like