You are on page 1of 65

UNIT 3

BY

Prof. K. R. Saraf
Qn. Compare the architectures and capabilities of ASIC, PDSP, GPP,
FPGA, and memory.
Programmable Array Logic
• One of the first successful families of PLDs was programmable array
logic (PAL) components
• These components were an evolution of earlier PLDs, but were
simpler to use in many applications.
• A simple representative component in the family is the PAL16L8,
whose circuit is shown in Figure below.
• The component has 10 pins that are inputs, 8 pins that are outputs,
and 6 pins that are both inputs and outputs. This gives a total of 16
inputs and 8 outputs (hence the name “16L8”)
• The symbol at each input in Figure represents a gate that is a
combination of a buffer and an inverter.
Programmable Array Logic
• Thus, the vertical signals carry all of the input signals and their
negations.

• The area in the dashed box is the programmable AND array of the
PAL.

• Each horizontal signal in the array represents a p-term of the inputs,


suggested by the AND-gate symbol at the end of the line.

• p-term, or product term, is the logical AND of a number of signals

• In the unprogrammed state, there is a wire called a fusible link , or


fuse , at each intersection of a vertical and horizontal signal wire,
connecting those signal wires.
Programmable Array Logic
• The PAL component can be programmed by blowing some of the
fuses to break their connections, and leaving other fuses intact.

• This is done by a special programming instrument before the


component is inserted into the final system.

• We draw an X at the intersection of a vertical and a horizontal signal


to represents an intact fuse. An intersection without an X means
that the intersecting signals are not connected.

• So, for example, the horizontal signal numbered 0 has connections


to the vertical signals numbered 24 and 31, which are the signals I8
and
Programmable Array Logic
• Some of the p-terms are connected to the enable control signals for
the inverting tristate output drivers. Others are connected to the 7-
input OR gates. So, for each output, we can form the AND-OR-
INVERT function of inputs, with up to 7 p-terms involved. output O1
implements the function I10 , with the

• output enabled by the condition


COMPLEX PLDS
• A further evolution of PLDs, tracking advances in integrated circuit
technology, led to the development of so-called complex
programmable logic devices (CPLDs).

• We can think of a CPLD as incorporating multiple PAL structures, all


interconnected by a programmable network of wires, as shown in
Figure below. (This gives a general idea of CPLD organization. The
actual organization varies between components provided by
different manufacturers.)

• Each of the PAL structures consists of an AND array and a number of


embedded macro cells (M/Cs in the figure).
Figure: - The internal organization of a CPLD
COMPLEX PLDS
• The macro cells contain OR gates, multiplexers and flip-flops,
allowing choice among combinational or registered connections to
other elements within the component, with or without logical
negation, choice of initialization for flip-flops, and so on.
• They are essentially expanded forms of the simple macro cell shown
in Figure, but without the direct connections to external pins.
• Instead, the external pins are connected to an I/O block, which
allows selection among macro cell outputs to drive each pin.
• The network interconnecting the PAL structures allows each PAL to
use feed-back connections from other PALs as well as inputs from
external pins.
COMPLEX PLDS
• As well as providing more circuit resources than simple PLDs,
modern CPLDs are typically programmed differently.

• Rather than using EPROM-like technology, they use SRAM cells to


store configuration bits that control connections in the AND-OR
arrays and the select inputs of multiplexers.

• Configuration data is stored in non volatile flash RAM within the


CPLD chip, and is transferred into the SRAM when power is applied.

• Separate pins are provided on the chip for writing to the flash RAM,
even while the chip is connected in the final system.

• Thus, designs using CPLDs can be upgraded by reprogramming the


configuration information.
COMPLEX PLDS
• Manufacturers provide a range of CPLDs, varying in the number of
internal PAL structures and input/output pins.

• A large CPLD may contain the equivalent of tens of thousands of


gates and hundreds of flip-flops, allowing for implementation of
quite complex circuits.

• Whereas it might be feasible to manually determine the


programming for a simple PLD, it would be quite intractable to do so
for a CPLD.

• Hence, we would use CAD tools to synthesize a design from an HDL


model and to map the design to the resources provided by a CPLD.
Field-Programmable Gate Arrays (FPGAs)
• As we saw in the last section, manufacturers were able to provide
larger programmable implementation fabrics by replicating the basic
PAL structure on a chip.

• However, there is a limit to how far this structure can be expanded.


For large designs, mapping the circuit onto CPLD resources becomes
very difficult and results in inefficient use of the resources provided
by the chip.

• For this reason, manufacturers turned to an alternate programmable


circuit structure, based on smaller programmable cells to implement
logic and storage functions, combined with an interconnection
network whose connections could be programmed.
Field-Programmable Gate Arrays (FPGAs)
• They named such structures field-programmable gate arrays

(FPGAs), since they could be thought of as arrays of gates whose

interconnection could be programmed “in the field,” away from the

factory where the chips were made.

• Given the relative complexity of the components, it was not

expected that designers would implement circuits for FPGAs

manually. Instead, manufacturers provided CAD tools to allow

designs expressed in an HDL to be synthesized, mapped, placed and

routed automatically, though with designer intervention if necessary.


Field-Programmable Gate Arrays (FPGAs)
• Since their introduction, FPGAs have grown in capacity and
performance, and are now one of the main implementation fabrics
for designs, particularly where product volumes do not warrant
custom integrated circuits.
• Most FPGAs available today are organized along the lines shown in
Figure 1 below.
• They include an array of logic blocks that can be programmed to
implement simple combinational or sequential logic functions;
input/output (I/O) blocks that can be programmed to be registered
or non registered, as well as implementing various specifications for
voltage levels, loading and timing; embedded RAM blocks; and a
programmable interconnection network.
Field-Programmable Gate Arrays (FPGAs)
• The more recent FPGAs also include special circuits for clock
generation and distribution.

• The specific organization, as well as the names used for the blocks,
varies between manufacturers and FPGA families.
Figure 1: - The internal organization of an FPGA consisting of logic blocks
(LB), input/out-put blocks (IO), embedded RAM blocks (RAM) and
programmable interconnections (shown in gray).
Field-Programmable Gate Arrays (FPGAs)
• In many FPGA components, the basic elements within logic blocks
are small 1-bit-wide asynchronous RAMs called lookup tables
(LUTs).

• The LUT address inputs are connected to the inputs of the logic
block.

• The content of an LUT determines the values of a Boolean function


of the inputs, in much the same way as we discussed previously.

• By programming the LUT content differently, we can implement any


Boolean function of the inputs.
Field-Programmable Gate Arrays (FPGAs)
• The logic blocks also contain one or more flip-flops and various
multiplexers and other logic for selecting data sources and for
connecting data to adjacent logic blocks.
• As an illustration, Figure 2 below shows the circuit for a slice within a
logic block of a Xilinx Spartan-II FPGA. The logic block contains two
such slices, together with a small amount of additional logic.
• Each slice consists of two 4-input LUTs, each of which can be
programmed to implement any function of the four inputs.
• The carry and control logic consists of circuitry to combine the LUT
outputs, an XOR gate and an AND gate for implementing adders and
multipliers, as well as multiplexers that can be used to implement a
fast carry chain.
Field-Programmable Gate Arrays (FPGAs)
• Additional components, not shown in the figure, allow programming
for various signals to be negated.
• A number of the connections within the control and carry logic are
governed by the programming of the FPGA.
• The logic block contains SRAM cells for these programming bits.
• In contrast to LUT-based logic blocks, which can implement relatively
complex functions, some FPGAs have more fine-grained logic blocks.
• For example, the logic block of Actel ProASIC3 FPGAs contains just
enough gates, multiplexers and switches to implement
combinational functions of three inputs, or a flip-flop with set or
reset.
Field-Programmable Gate Arrays (FPGAs)
• Since each logic block is smaller and simpler, CAD software that maps
a design into the FPGA resources may find it easier to perform its
task without leaving parts of logic blocks unused.

• However, a given design will require more logic blocks, and


consequently denser interconnection between them.

• This may make the place and route software’s task more difficult.
Figure 2: -
The circuit of
a slice of a
Xilinx
Spartan-II
FPGA logic
block
Multi-context FPGA
• Dynamically-programmable gate arrays (DPGAs) provide more cost-
effective implementations than conventional FPGAs where
hardware resources are dedicated to a single context.

• A DPGA can be sequentially configured as different processors in


real time, and efficiently reuse the limited hardware resources in
time.

• One of typical DPGA architectures is a multi-context one.

• Multi-context FPGAs (MC-FPGAs) have multiple memory bits per


configuration bit forming configuration planes for fast switching
between contexts.
Multi-context FPGA
• However, the additional memory planes cause significant over-head
in area and power consumption.

• Figure 1 below shows the overall structure of an MC-FPGA. Each cell


consists of a programmable logic block and a programmable switch
block.

• Figure 2 below shows the structure of a conventional multi-context


switch.

• The switch has multiple memory bits for multi-contexts and its
contexts are selected from the memory bits according to a context
ID.
Multi-context FPGA
• In the conventional approach, each switch requires n bits to store n
contexts.

• Most previous works for DPGAs reduce the over-head using device-
level solutions.

• That is, compact memory devices such as DRAM and FeRAM were
used to store configuration data
Programmable digital signal processors (PDSPs)
• Programmable digital signal processors (PDSPs) are general-purpose
microprocessors designed specifically for digital signal processing
(DSP) applications.

• They contain special instructions and special architecture supports so


as to execute computation-intensive DSP algorithms more efficiently.

• PDSPs are designed mainly for embedded DSP applications.

• As such, the user may never realize the existence of a PDSP in an


information appliance.

• Important applications of PDSPs include modem, hard drive


controller, cellular phone data pump, set-top box, etc.
Programmable digital signal processors (PDSPs)
• The categorization of PDSPs falls between the general-purpose
microprocessor and the custom-designed, dedicated chip set.
• The former have the advantage of ease of programming and
development.
• However, they often suffer from disappointing performance for DSP
applications due to overheads incurred in both the architecture and
the instruction set.
• Dedicated chip sets, on the other hand, lack the flexibility of
programming.
• The time to market delay due to chip development may be longer
than the program coding of programmable devices.
Common Features of PDSPs
• 1 Harvard Architecture

• A key feature of PDSPs is the adoption of a Harvard memory


architecture that contains separate program and data memory so as
to allow simultaneous instruction fetch and dada access.

• This is different from the conventional Von Neuman architecture


where program and data are stored in the same memory space.

• 2 Dedicated Address generator

• Address generator allows rapid access of data with complex data


arrangement without interfering the pipelined execution of main
ALUs (arithmetic and logic units).
Common Features of PDSPs
• This is useful for situations such as two-dimensional (2D) digital
filtering, and motion estimation.

• Some address generators may include bit-reversal address


calculation to support efficient implementation of FFT, and circular
buffer addressing for the implementation of infinite impulse
response (IIR) digital filters.

• 3 High bandwidth Memory and I/O controller

• To meet the intensive input and output demands of most signal


processing applications, several PDSPs have built-in multichannel
DMA channels and dedicated DMA buses to handle data I/O
Common Features of PDSPs
• without interfering CPU operations.

• To maximize data I/O efficiency, some modern PDSPs even include


dedicated video and audio codec (coder/decoder) as well as high-
speed serial/parallel communication port.

• 4 Data Parallelism

• A number of important DSP applications exhibit high degree of data


parallelism that can be exploited to accelerate the computation.

• As a result, several parallel processing schemes, SIMD

• (Single Instruction Multiple Data) and MIMD (Multiple Instruction


Multiple Data) architecture have incorporated in the PDSP.
Common Features of PDSPs
• For example, many multimedia enhanced instruction sets in general-
purpose microprocessors (e.g. MMX) employed subword parallelism
to speed-up the execution.

• It is basically a SIMD approach. A number of PDSPs also facilitate


MIMD implementation by providing multiple inter processor
communication links.
Applications of PDSP
• Communications systems

• Multimedia: - Audio Signal Processing, Image/Video processing,


Printing, SAR Image Processing, Biometric Information Processing

• Control and Data acquisition

• DSP Applications of Hardware Programmable PDSP


VLIW
• Very long instruction word or VLIW refers to a processor architecture
designed to take advantage of instruction level parallelism

• Instruction of a VLIW processor consists of multiple independent


operations grouped together.

• There are Multiple Independent Functional Units in VLIW processor


architecture.

• Each operation in the instruction is aligned to a functional unit.

• All functional units share the use of a common large register file.

• This type of processor architecture is intended to allow higher


performance without the inherent complexity of some other
approaches
Different Approaches in VLIW
• Pipelining: - Breaking up instructions into sub-steps so that
instructions can be executed partially at the same time

• Super scalar architectures: - Dispatching individual instructions to be


executed completely independently in different parts of the
processor

• Out-of-order execution: - Executing instructions in an order different


from the program Instruction-level parallelism (ILP) is a measure of
how many of the operations in a computer program can be
performed simultaneously.
Different Approaches in VLIW
• The over lap among instructions is called instruction level
parallelism.

• Ordinary programs are typically written under a sequential


execution model where instructions execute one after the other and
in the order specified by the programmer.

• Goal of compiler and processor designers implementing ILP is to


identify and take advantage of as much ILP as possible.
• What is ILP?
• Consider the following program:
• op 1 e = a + b
• op2 f = c + d
• op3 m = e * f
• Operation 3 depends on the results of operations 1 and 2, so it
cannot be calculated until both of them are completed

• However, operations 1 and 2 do not depend on any other operation,


so they can be calculated simultaneously

• If we assume that each operation can be completed in one unit of


time then these three instructions can be completed in a total of
two units of time giving an ILP of 3/2.

• VLIW Compiler

• Compiler is responsible for static scheduling of instructions in VLIW


processor.
• Compiler finds out which operations can be executed in parallel in
the program.

• It groups together these operations in single instruction which is the


very large instruction word.

• Compiler ensures that an operation is not issued before its operands


are ready.

• VLIW Instruction

• One VLIW instruction word encodes multiple operations which


allows them to be initiated in a single clock cycle.

• The operands and the operation to be performed by the various


functional units are specified in the instruction itself.
VLIW Instruction
• One instruction encodes at least one operation for each execution
unit of the device.

• So length of the instruction increases with the number of execution


units

• To accommodate these operation fields, VLIW instructions are


usually at least 64 bits wide, and on some architectures are much
wider up to 1024 bits.
Block Diagram
Diagram (Conceptual Instruction Execution)
Working
• Long instruction words are fetched from the memory

• A common multi-ported register file for fetching the operands and


storing the results.

• Parallel random access to the register file is possible through the


read/write cross bar.

• Execution in the functional units is carried out concurrently with the


load/store operation of data between RAM and the register file.

• One or multiple register files for FX and FP data.

• Rely on compiler to find parallelism and schedule dependency free


program code.
Advantages of VLIW
• Dependencies are determined by compiler and used to schedule
according to function unit latencies .

• Function units are assigned by compiler and correspond to the


position within the instruction packet.

• Reduces hardware complexity.

• Tasks such as decoding, data dependency detection, instruction


issues etc. becoming simple.

• Ensures potentially higher Clock Rate.

• Ensures Low power consumption


Disadvantages of VLIW
• Higher complexity of the compiler

• Compatibility across implementations : Compiler optimization needs


to consider technology dependent parameters such as latencies and
load-use time of cache.

• Unscheduled events (e.g. cache miss) stall entire processor.

• Code density: In case of un-filled op codes in a VLIW, memory space


and instruction bandwidth are wasted i.e. low slot utilization.

• Code expansion: Causes high power consumption


Applications of VLIW
• VLIW architecture is suitable for Digital Signal Processing
applications.

• Processing of media data like compression/decompression of Image


and speech data.
Qn. Write a short note on RALU
• The RALU (Reconfigurable detects an instruction which can be
dynamically optimized and then it changes the connection matrix
between mini computing units.

• Finally it executes the instruction.

• The multiplication and addition operation are detected as the


instructions that can be dynamically optimised.

• If the instruction to be executed is "multiplication", Mini Computing


(MC) units are dynamically connected to create circuit equivalent to
"Ripple Carry Array Multiplier".

• If the instruction to be executed is "addition", the MC units are


connected to create circuit equivalent to "Ripple Carry Adder".

• The Fig below depicts the components of the RALU.


Qn. Write a short note on RALU
• The input signals to the RALU are processed by "Input Feeder"
before feeding them to the Data Pre selector (DP) units.

• If the operation is addition the individual bits of the two numbers to


be added are fed directly to the DP units. If the operation is
multiplication then logical' AND' is performed between individual
bits of the first number and individual bits of the second number
and the result is fed to the DP units.

• This is because in binary multiplication is equitant to logical 'AND'


operation. Further in standard long multiplication, multiply the
multiplicand by each digit of the multiplier and then add up all the
properly shifted results. The DP units are used to select one of the
outputs from "Input Feeder" based on the instruction and feed the
MC units.

• The MC units behave as "Full Adder" with special capability to


change the connection between units.
Qn. Write a short note on RALU
• The Data Feeder (OF) changes the connection between MC units
based on the processor instruction. Essentially this switches
between "Ripple Carry Array Multiplier" and "Ripple Carry Adder".
The Simple Data Feeder (SDF) functionality is equivalent to OF, but it
does not contain some of the control signals since this is the final
stage of the RALU's connection matrix.
Qn. Limitations of current FPGAs
• Disadvantages of FPGA:
• The programming of FPGA requires knowledge of VHDL/Verilog
programming languages as well as digital system fundamentals.
The programming is not as simple as C programming used in
processor based hardware. Moreover engineers need to learn use
of simulation tools.
• The power consumption is more and programmers do not have any
control on power optimization in FPGA. No such issues in ASIC.
• Once any particular FPGA is selected and used in the design,
programmers need to make use of resources available on the FPGA
IC. This will limit the design size and features. TO avoid such
situation, appropriate FPGA need to be chosen at the beginning
itself.
• FPGAs are better for prototyping and low quantity production. When
the quantity of FPGAs to be manufactured increases, cost per
product also increases. This is not the case with ASIC
implementation.
Matrix concepts
• Matrix is designed to maintain flexibility in Instruction control.

• Matrix is based on a uniform array of primitive elements and


interconnect which can serve instruction control and data functions.

• The key to providing this flexibility is a multilevel configuration


scheme which allows device to control the way it deliver
configuration information.

• Matrix architecture

• Matrix micro architecture is based around an array of identical 8 bit


primitive data path elements over layed with a configurable network
Matrix – Basic Function Unit

1) 256x8 memory = function as a Single 256 byte ,dual ported and 128
X 8 bit In register file mode the memory supports two reads and one
write operation on each cycle.
Matrix – Basic Function Unit
2) 8-bit ALU=set of arithmaticand logic functions

3) Control logic=Composed of

1) Local pattern matcher for generating local control from the


ALU output

2)a reduction network for generating local control

3) a 20-input 8-output NOR block which can serve as half of PLA

• MATRIX operation

• Matrix operation is pipelined at the BFU level with pipeline register


at each BFU input port.
MATRIX operation
• Pipeline stage includes:

I. Memory read

II. ALU operation

III. Memory write and local interconnect traversal= two operations


proceed in parallel

BFU role

-I store

-Data memory

-ALU function
Matrix network
• Collection of a 8 bit busses

• Dynamically switch network connections

1. Nearest neighbour Connection= connection between BFU and two


grid squares

2. Length four bypass connection=each BFU support level two


connections

-Which allows corner turns ,local fanout, medium distance


interconnect, data shifting and retiming

3.Global Lines-every row and column supports four interconnects lines


which span the entire row or column.
MATRIX example
• Finite Impulse Response filter
Dynamically Programmable Gate Arrays
with Input Registers
• We must hold the value on the output and tie up switches and wires
between the producer and the consumer until such time as the final
consumer has used the value.

• Switches and wires are forced to sit idle holding values for much longer
than the time.

• The alternative is to move the value registers to the inputs of the


computational elements.

• These input registers allow us to store values which need to traverse LUT
evaluation levels in memories rather than having them consume active
resources during the period of time which they are being retimed
Input Registers
• Having four flip-flops on the input of each 4-LUT rather than one
flip-flop on the output.

• This modification allows us to move the data from the producer to


consumer in the minimum transit time --a time independent of
when the consumer will actually use the data.

• Conceptually, the key idea here is that signal transport and retiming
are two different functions:

• Spatial Transport--moves data in space --route data from source to


destination

• Temporal Transport(Retiming) --moves data in time --make data


available at some later time when it is actually required
TSFPGA
• TSFPGA WAS DEVELOPED JOINTLY BY DERRICKCHEN AND
ANDREDEHON. DERRICK WORKED OUT VLSI
• IMPLEMENTATION AND LAYOUT ISSUES, WHILE AND REDEVELOPED
THE ARCHITECTURE AND MAPPING TOOLS.
• Why TSFPGA?
• If all retiming can be done in input registers, only a single wire is
strictly needed to successfully route the task.
• Extends the temporal range on the inputs without the linear increase
in input retiming size
• The trick we employ here is to have each logical input load its value
from the active interconnect at just the right time
Why TSFPGA?
• If we broadcast the current time step, each input can simply load its
value when its programmed load time matches the current time
step.
• Architecture of TSFPGA
Building elements:
• The basic TSFPGA building block is the sub array tile which contains
a collection of LUTs and a central switching crossbar.

• ARRAY ELEMENTS

• CROSSBAR

• SWITCHING ELEMENTS.

• Array Element

• Figure: - TSFPGA Array

Composition
TSFPGA Array Composition
• The TSFPGA array element is made up of a number of LUTs which
share the same crossbar outputs and input.

• The LUT input values are stored in time-switched input registers.

• The inputs to the array element are run to all LUT input registers.
When the current time step matches the programmed load time, the
input register is enabled to load the value on the array-element
input.
Crossbar
• Each crossbar input is selected from a collection of sub array
network inputs and sub array LUT outputs via by a pre-crossbar
multiplexor.

• Sub array inputs are registered prior to the pre-crossbar multiplexor


and outputs are registered immediately after the crossbar, either on
the LUT inputs or before traversing network wires.

• This pipelining makes the LUT evaluations and crossbar traversal a


single pipeline stage.

• Each registered, crossbar output is routed in several directions to


provide connections to other sub arrays or chip I/O.
Crossbar
• The single sub array crossbar performs all major switching roles:

• output crossbar--routing data from LUT outputs to destinations or

intermediate switching crossbars

• routing crossbar--routing data through the network between source

and destination sub arrays

• input crossbar --receiving data from the network and routing it to

the appropriate destination LUT input


Intra-Sub array Switching
• Communication within the sub array is simple and takes one clock
cycle per LUT evaluation and interconnect.

• Once a LUT has all of its inputs loaded, the LUT output can be
selected as an input to the crossbar, and the LUT's consumers within
the sub array may be selected as crossbar outputs.

• Intra-Sub array Switching

• A number of subarray outputs are run to each subarrayin the same


row and column.

You might also like