You are on page 1of 7

Savitribai Phule PUNE UNIVERSITY

Reconfigurable Computing (504203)

Module I Notes

INTRODUCTION
Those architectures can be categorized in three main groups according to their degree of flexibility:
i. The general purpose computing group that is based on the Von Neumann (VN) computing paradigm;
ii. Domain-specific processors, tailored for a class of applications having in common a great range of
characteristics;
iii. Application-specific processors tailored for only one application.

1. General Purpose Computing


General Purpose Computers are based on Von-Neumann Architecture. According to which, a computer could have a
simple, fixed structure, able to execute any kind of computation, given a properly programmed control, without the need
for hardware modification.
The general structure of a VN machine is as shown in figure below:

It consists of:

A memory for storing program and data. Harvard architectures contain two parallel accessible memories for
storing program and data separately.
A control unit (also called control path) featuring a program counter that holds the address of the next instruction
to be executed.
An arithmetic and logic unit (also called data path) in which instructions are executed.
A program is coded as a set of instructions to be executed sequentially, instruction after instruction. At each step of the

program execution, the next instruction is fetched from the memory at the address specified in the program counter and
decoded. The required operands are then collected from the memory before the instruction is executed. After execution,
the result is written back into the memory.
In this process, the control path is in charge of setting all signals necessary to read from and write to the memory, and to

allow the data path to perform the right computation. The data path is controlled by the control path, which interprets the
instructions and sets the data paths signals accordingly to execute the desired operation.
In general, the execution of an instruction on a VN computer can be done in five cycles: Instruction Read (IR) in which

an instruction is fetched from the memory; Decoding (D) in which the meaning of the instruction is determined and the
operands are localized; Read Operands (R) in which the operands are read from the memory; Execute (EX) in which the
instruction is executed with the read operands; Write Result (W) in which the result of the execution is stored back to the
memory.
In each of those five cycles, only the part of the hardware involved in the computation is activated. The rest remains

idle.
For example, if the IR cycle is to be performed, the program counter will be activated to get the address of the instruction,
the memory will be addressed and the instruction register to store the instruction before decoding will be also activated.
Apart from those three units (program counter, memory and instruction register), all the other units remain idle.
Fortunately, the structure of instructions allows several of them to occupy the idle part of the processor, thus increasing
the computation throughput.
All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

Instruction Level Parallelism


Pipelining is a transparent way to optimize the hardware utilization as well as the performance of programs. Because the
execution of one instruction cycle affects only a part of the hardware, idle parts could be activated by having many
different cycles executing together.
For one instruction, it is not possible to have many cycles being executed together. For instance, any attempt to perform
the execute cycle (EX) together with the reading of an operand (R) for the same instruction will not work, because the
data needed for the EX should first be provided by R cycle.
Nevertheless, the two cycles EX and D can be performed in parallel for two different instructions. Once the data have
been collected for the first instruction, its execution can start while the data are being collected for the second
instruction. This overlapping in the execution of instructions is called pipelining or instruction level parallelism (ILP),
and it is aimed at increasing the throughput in the execution of instructions as well as the resource utilization. It should
be mentioned that ILP does not reduce the execution latency of a single execution, but increases the throughput of a set
of instructions.
.

If tcycle is the time needed to execute one cycle, then the execution of one instruction will require 5tcycle to perform. If
three instructions have to be executed, then the time needed to perform the execution of those three instructions without
pipelining is 15 tcycle, as illustrated in figure. Using pipelining, the ideal time needed to perform those three
instructions, when no hazards have to be dealt with, is 7tcycle. In reality, we must take hazards into account. This
increases the overall computation time to 9 tcycle.
The main advantage of the VN computing paradigm is its flexibility, because it can be used to program almost all
existing algorithms. However, each algorithm can be implemented on a VN computer only if it is coded according to
the VN rules. We say in this case that The algorithm must adapt itself to the hardware
Disadvantage:
With the fact that all algorithms must be sequentially programmed to run on a VN computer, many algorithms
cannot be executed with their potential best performance. Algorithms that usually perform the same set of inherent
parallel operations on a huge set of data are not good candidates for implementation on a VN machine.
If the class of algorithms to be executed is known in advance, then the processor can be modified to better match
the computation paradigm of that class of application. In this case, the data path will be tailored to always execute
the same set of operations, thus making the memory access for instruction fetching as well as the instruction
decoding redundant.
Also because of the temporal use of the same hardware for a wide variety of applications, VN computation is often
characterized as temporal computation.

All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

2. Domain-Specific Processors
A domain-specific processor is a processor tailored for a class of algorithms.
The data path is tailored for an optimal execution of a common set of operations that mostly characterizes the

algorithm in the given class.


Also, memory access is reduced as much as possible.

Digital Signal Processor (DSP) belong to the most used domain-specific processors.
A DSP is a specialized processor used to speed-up computation of repetitive numerically intensive tasks in signal
processing areas such as telecommunication, multimedia, automobile, radar, sonar, seismic, image processing, etc.
The most often cited feature of the DSPs is their ability to perform one or more multiply accumulate (MAC)
operations in single cycle. Usually, MAC operations have to be performed on a huge set of data. In a MAC
operation, data are first multiplied and then added to an accumulated value.
The normal VN computer would perform a MAC in 10 steps. The first instruction (multiply) would be fetched, then
decoded, then the operand would be read and multiply, the result would be stored back and the next instruction
(accumulate) would be read, the result stored in the previous step would be read again and added to the accumulated
value and the result would be stored back.
DSPs avoid those steps by using specialized hardware that directly performs the addition after multiplication without
having to access the memory.
Because many DSP algorithms involve performing repetitive computations, most DSP processors provide special
support for efficient looping. Often a special loop or repeat instruction is provided, which allows a loop
implementation without expending any instruction cycles for updating and testing the loop counter or branching
back to the top of the loop.
DSPs are also customized for data with a given width according to the application domain. For example if a DSP is
to be used for image processing, then pixels have to be processed. If the pixels are represented in Red Green Blue
(RGB) system where each colour is represented by a byte, then an image processing DSP will not need more than 8
bit data path. Obviously, the image processing DSP cannot be used again for applications requiring 32 bits
computation.
Advantage and Disadvantage: This specialization of the DSPs increases the performance of the processor and improves the device utilization.
However, the flexibility is reduced, because it cannot be used anymore to implement other applications other than
those for which it was optimally designed.

3. Application-Specific Processors
Although DSPs incorporate a degree of application-specific features such as MAC and data width optimization, they
still incorporate the VN approach and, therefore, remain sequential machines. Their performance is limited.
If a processor has to be used for only one application, which is known and fixed in advance, then the processing unit
could be designed and optimized for that particular application. In this case, we say that the hardware adapts itself t the
application.
For example, In multimedia processing, processors are usually designed to perform the compression of video frames
according to a video compression standard. Such processors cannot be used for something else than compression. Even
in compression, the standard must exactly match the one implemented in the processors.
A processor designed for only one application is called an Application-Specific Processor (ASIP).

In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated.


The instruction set of the application is directly implemented in hardware.
Input data stream in the processor through its inputs, the processor performs the required computation and the
results can be collected at the outputs of the processor.
ASIPs are usually implemented as single chips called Application-Specific Integrated Circuit (ASIC)

Example : If algorithm 1 has to execute on a Von Neumann computer, then at least 3 instructions are required.

All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

With tcycle being the instruction cycle, the program will be executed in 3 5 tcycles = 15 tcycle without pipelining.
Let us now consider the implementation of the same algorithm in an ASIP. We can implement the instructions d = a +

b and c = a b in parallel. The same is also true for d = b + 1, c = a 1 as illustrated in figure below:

Fig: ASIP implementation of Algorithm 1

The four instructions a+b, ab, b+1, a1 as well as the comparison a < b will be executed in parallel in a first stage.
Depending on the value of the comparison a < b, the correct values of the previous stage computations will be assigned
to c and d as defined in the program.
Let tmax be the longest signal needed by a signal to move from one point to another in the physical implementation of
the processor (this will happen on the path Input-multiply-multiplex). tmax is also called the cycle time of the ASIP
processor.
For two inputs a and b, the results c and d can be computed in time tmax. The VN processor can compete with this
ASIP only if 15 tcycle < tmax, i.e. tcycle < tmax/15. The VN must be at least 15 times faster than the ASIP to be
competitive.
ASIPs use a spatial approach to implement only one application. The functional units needed for the computation of all
parts of the application must be available on the surface of the final processor. This kind of computation is called
Spatial Computing.
Once again, an ASIP that is built to perform a given computation cannot be used for other tasks other than those for
which it has been originally designed.

4. Reconfigurable Computing
We can identify two main means to characterize processors: flexibility and performance.
VN COMPUTERS
The VN computers are very flexible because they are able to compute any kind of task. This is the reason why the
terminology GPP (General Purpose Processor) is used for the VN machine.
They do not bring so much performance, because they cannot compute in parallel. Moreover, the five steps (IR, D,
R, EX, W) needed to perform one instruction becomes a major drawback, in particular if the same instruction has
to be executed on huge sets of data.
Flexibility is possible because the application must always adapt to the hardware in order to be executed.
ASIPs:
ASIPs bring much performance because they are optimized for a particular application. The instruction set
required for that application can then be built in a chip.
Performance is possible because the hardware is always adapted to the application.

All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

Ideally, we would like to have the flexibility of the GPP and the performance of the ASIP in the same device. We would
like to have a device able to adapt to the application on the fly. We call such a hardware device a reconfigurable
hardware or reconfigurable device or reconfigurable processing unit (RPU).
Reconfigurable computing is defined as the study of computation using reconfigurable devices.
For a given application, at a given time, the spatial structure of the device will be modified such as to use the best
computing approach to speed up that application. If a new application has to be computed, the device structure will be
modified again to match the new application. Contrary to the VN computers, which are programmed by a set of
instructions to be executed sequentially, the structure of reconfigurable devices are changed by modifying all or part of
the hardware at compile-time or at run-time, usually by downloading a so called bitstream into the device.
Configuration is the process of changing the structure of a reconfigurable device at star-up-time respectively at runtime.
Reconfiguration is the process of changing the structure of a reconfigurable device at run-time.

5. Fields of Application
5.1 Rapid Prototyping
The development of an ASIC that can be seen as physical implementation of ASIPs is a cumbersome process

consisting of several steps, from the specification down to the layout of the chip and the final production. Because of
the different optimization goals in development, several teams of engineers are usually involved. This increases the
Non-recurring engineering (NRE) cost that can only be amortized if the final product is produced in a very large
quantity.
Contrary to software development, errors discovered after the production mean enormous loss because the produced
pieces become unusable.
Rapid prototyping allows a device to be tested in real hardware before the final production. In this sense, errors can
be corrected without affecting the pieces already produced. A reconfigurable device is useful here, because it can be
used several times to implement different versions of the final product until an error-free state.
One of the concepts related to rapid prototyping is hardware emulation, in analogy to software simulation. With
hardware emulation, a hardware module is tested under real operating conditions in the environment where it will be
deployed later.

5.2 In-System Customization

Time to market is one of the main challenges that electronic manufacturers face today. In order to secure market
segments, manufacturers must release their products as quickly as possible. In many cases, a well working product
can be released with less functionality. The manufacturer can then upgrade the product on the field to incorporate
new functionalities.
A reconfigurable device provides such capabilities of being upgraded in the field, by changing the configuration.
In-system customization can also be used to upgrade systems that are deployed into non-accessible or very difficult
to access locations. One example is the Mars rover vehicle in which some FPGAs, which can be modified from the
earth, are used.

5.3 Multi-modal Computation


The number of electronic devices we interact with is permanently increasing. Many people hold besides a mobile

phone, other devices such as handhelds, portable mp3 player, portable video player, etc. Besides those mobile
devices, fixed devices such as navigation systems, music and video players as well as TV devices are available in
cars and at home.
All those devices are equipped with electronic control units that run the desire application on the desired device.
Furthermore, many of the devices are used in a time multiplexed fashion. It is difficult to imagine someone playing
mp3 songs while watching a video clip and given a phone call.
For a group of devices used exclusively in a time multiplexed way, only one electronic control unit can be used.
Whenever a service is needed, the control unit is connected to the corresponding device at the correct location and
reconfigured with the adequate configuration.
For instance, a domestic mp3, a domestic DVD player, a car mp3, a car DVD player as well as a mobile mp3 player
and a mobile video player can all share the same electronic unit, if they are always used by the same person.
The user just needs to remove the control unit from the domestic devices and connect them to one car device when
going to work. The control unit can be removed from the car and connected to a mobile device if the user decides to
go for a walk. Coming back home, the electronic control unit is removed from the mobile device and used for
watching video.

All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

5.4 Adaptive Computing Systems


Advances in computation and communication are helping the development of ubiquitous and pervasive computing.
The design of ubiquitous computing system is cumbersome task that cannot be dealt with only at compile time.

Because of uncertainty and unpredictability of such systems, it is impossible, at compile time, to address all
scenarios that can happen at run-time, because of unpredictable changes in the environment.
We need computing systems that are able to adapt their behavior and structure to change operating and
environmental conditions, to time-varying optimizing objectives, and to physical constraints such as changing
protocols and new standards. We call those computing systems Adaptive Computing System.
Reconfiguration can provide a good fundament for the realization of adaptive systems, because it allows system to
quickly react to changes by adopting the optimal behavior for a given run-time scenario.

6. Reconfigurable Devices

Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives,
interconnected via a configurable network.
The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be
implemented spatially on the device with intermediates flowing directly from the producing function to the
receiving function.
Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented
processing typical of programmable architectures such as microprocessors.

6.1 Reconfigurable Devices characteristics:


The key differences between reconfigurable machines and conventional processors are:
Instruction Distribution Rather than broadcasting a new instruction to the functional units on every cycle,
instructions are locally configured, allowing the reconfigurable device to compress instruction stream distribution and
effectively deliver more instructions into active silicon on each cycle.
Spatial routing of intermediates As space permits, intermediate values are routed in parallel from producing
function to consuming function rather than forcing all communication to take place in time through a central resource
bottleneck.
More, often finer-grained, separately programmable building blocks Reconfigurable devices provide a large
number of separately programmable building blocks allowing a greater range of computations to occur per time step.
This effect is largely enabled by the compressed instruction distribution.
Distributed deployable resources, eliminating bottlenecks Resources such as memory, interconnect, and functional
units are distributed and deployable based on need rather than being centralized in large pools. Independent, local
access allows reconfigurable designs to take advantage of high, local, parallel on-chip bandwidth, rather than creating a
central resource bottleneck.

7. Programmable, Configurable and Fixed Function Devices


Programmable we will use the term programmable to refer to architectures which heavily and rapidly reuse a single
piece of active circuitry for many different functions. The canonical example of a programmable device is a processor
which may perform a different instruction on its ALU on every cycle. All processors, be they microcoded, SIMD, Vector,
or VLIW are included in this category.
Configurable we use the term configurableto refer to architectures where the active circuitry can perform any of a
number of different operations, but the function cannot be changed from cycle to cycle. FPGAs are our canonical example
of a configurable device.
Fixed Function, Limited Operation Diversity, High Throughput When the function and data granularity to be
computed are well-understood and fixed, and when the function can be economically implemented in space, dedicated
hardware provides the most computational capacity per unit area to the application.
Variable Function, Low Diversity If the function required is unknown or varying, but the instruction or data diversity is
low; the task can be mapped directly to a reconfigurable computing device and efficiently extract high computational
density.
Space Limited, High Entropy If we are limited spatially and the function to be computed has a high operation and data
diversity, we are forced to reuse limited active space heavily and accept limited instruction and data bandwidth. In this
regime, conventional processor organization are most effective since they dedicate considerable space to on-chip
instruction storage in order to minimize off-chip instruction traffic while executing descriptively complex tasks.
All rights reserved - Not to be copied and Sold

Savitribai Phule PUNE UNIVERSITY


Reconfigurable Computing (504203)

Module I Notes

8. General-Purpose Computing
General-purpose computing devices are specifically intended for those cases where we cannot or need not dedicate
sufficient spatial resources to support an entire computational task or where we do not know enough about the
required task or tasks prior to fabrication to hardwire the functionality.
The key ideas behind general-purpose processing are:
i.
Postpone binding of functionality until device is employed i.e. after fabrication
ii.
Exploit temporal reuse of limited functional capacity
Delayed binding and temporal reuse work closely together and occur at many scales to provide the characteristics
we now expect from general-purpose computing devices.
Market Level Rather than dedicating a machine design to a single application or application family, the
design effort may be utilized for many different applications.
System Level Rather than dedicating an expensive machine to a single application, the machine may
perform different applications at different times by running different sets of instructions.
Application Level Rather than spending precious real estate to build a separate computational unit for
each different function required, central resources may be employed to perform these functions in sequence
with an additional input, an instruction, telling it how to behave at each point in time.
Algorithm Level Rather than fixing the algorithms which an application uses, an existing general-purpose
machine can be reprogrammed with new techniques and algorithms as they are developed.
User Level Rather than fixing the function of the machine at the supplier, the instruction stream specifies
the function, allowing the end user to use the machine as best suits his needs.Machines may be used for
functions which the original designers did not conceive. Further, machine behavior may be upgraded in the
field without incurring any hardware or hardware handling costs.
9. General-Purpose Computing Issues
There are two key features associated with general-purpose computers which distinguish them from their specialized
counterparts:
Interconnect
In general-purpose machines, the datapaths between functional units cannot be hardwired.
Different tasks will require different patterns of interconnect between the functional units. Within a task
individual routines and operations may require different interconnectivity of functional units.
General-purpose machines must provide the ability to direct data flow between units. In the extreme of a single
functional unit, memory locations are used to perform this routing function.
As more functional units operate together on a task, spatial switching is required to move data among functional
units and memory.
The flexibility and granularity of this interconnect is one of the big factors determining yielded capacity on a
given application.
Instructions
Since general-purpose devices must provide different operations over time, either within a computational task or
between computational tasks, they require additional inputs, instructions, which tell the silicon how to behave at
any point in time.
Each general-purpose processing element needs one instruction to tell it what operation to perform and where to
find its inputs.
The handling of this additional input is one of the key distinguishing features between different kinds of generalpurpose computing structures.
When the functional diversity is large and the required task throughput is low, it is not efficient to build up the
entire application dataflow spatially in the device. Rather, we can realize applications, or collections of
applications, by sharing and reusing limited hardware resources in time (See Figure 2.1) and only replicating the
less expensive memory for instruction and intermediate data storage.

All rights reserved - Not to be copied and Sold