You are on page 1of 4

Performance Model for a Reconfigurable

Coprocessor
Indrajeet Kalyankar
Department of Electrical and Computer Engineering
Old Dominion University, Norfolk, Virginia 23529
Email: ikaly001@odu.edu

I. I NTRODUCTION ity is either in the software and to a certain extent in


the hardware. Hence the distinction can be referred to as
The important characteristic of a Reconfigurable Logic programmable processors and configurable ones[3]. Recon-
Module is its ability to perform computations in hardware figurable computing is intended to bridge the gap between
to increase performance, while being as mouldable as a hardware and software, achieving potentially much higher
software solution. Although more expensive than ASICs, FP- performance than software, while maintaining a higher level
GAs provide a simplified low cost design platform. Given a of flexibility than hardware. Reconfigurable devices, including
computational task, three approaches can be taken to solve the field-programmable gate arrays (FPGAs), contain an array
problem. of computational elements whose functionality is determined
The first is to use hard wired technology, either an Applica- through multiple programmable configuration bits[6]. These
tion Specific Integrated Circuit (ASIC) or a bunch of individual elements, some times known as logic blocks, are connected
components forming an on-board solution, to perform the task using a set of routing resources that are also programmable.
in hardware. ASICs are designed specifically to perform a In this way, custom digital circuits can be mapped to the
given computation, and thus are very fast and efficient[1] reconfigurable hardware by computing the logic functions of
when executing the exact computation they were designed the circuit within the logic blocks, and using the configurable
for. However, the circuit cannot be altered after fabrication. routing to connect the blocks together to form the necessary
This necessitates a redesign and re fabrication of the chip circuit.
if any part of its circuit requires modification. This is an Perhaps more attractive however is the ability of the FPGA
expensive process, especially when one considers the problems to operate as an unlimited number of unique circuits through
in replacing ASICs in a large number of systems. Board-level device configuration. The cost of the device can be amortized
circuits are also inflexible to a certain degree, often requiring over the various circuits that operate on the device. One
a board redesign and replacement in the event of changes to method used to mix the programmability of processor based
the application. systems with the hardware configurability of FPGA based
The second method is to use software-programmed micro- systems is to implement application specific processors on an
processors, a far more flexible solution. Processors execute FPGA.
a sequence of instructions to perform a computation. By
changing the machine instructions, the functionality of the II. BACKGROUND
system is altered without changing the hardware. However, Today, many kinds of reconfigurable logic devices are
the downside of this flexibility is that the performance suffers, available, i.e., FPGAs and CPLDs (Complex Programmable
and is rather far below that of an ASIC. The processor must Logic Devices), and they are the keys to constructing. Al-
fetch each instruction from memory, decode its meaning, most all reconfigurable systems use commercially available
and only then execute it. This results in a high execution FPGAs/CPLDs, but some utilize custom reconfigurable chips.
overhead for each individual operation. Additionally, the set of In this section, FPGAs/CPLDs are categorized according to
instructions that may be used by a program is determined at the the characteristics[2].
fabrication time of the processor. Any other operations that are • Configurable logic: This is a general term for logic
to be implemented must be built out of existing instructions. devices that can be customized one time.
Video games can also be said to reconfigurable, because they • Reconfigurable logic: This refers to logic devices that can
can be changed to different game machines by exchanging be customized many times. As expected, these devices
ROM cartridges or CD-ROMs. In addition, todays micro often adopt EPROM, EEPROM, or FLASH technology.
controller based embedded systems, such as those found in • Dynamically reconfigurable logic: This supports on the-
automobiles and almost all household electric appliances, can fly programming capability after mounting on a system
be categorized in the same group, because different functions board. It is often called in-circuit reconfiguration.
are provided by changing program ROMs. • Dynamically reconfigurable interconnect: This is a gen-
However, for all of these examples, the reconfigurabil- eral term for interconnect devices that can be pro-
grammed pin-to-pin connections after mounting on a ability of a sequence of references to functional blocks that
system board. They are similar to the above SRAM based pertain to a definite task is called functional locality.
FPGAs, but do not have any programmable logic blocks. Temporal locality can be exploited by both caches and
• Virtual logic: This is a kind of dynamically reconfigurable reconfigurable hardware. Given some references to group of
logic device, but it features partial reconfiguration capa- locations, it is highly probable that the same locations will
bility. This mechanism allows part of the device to be be referenced again in the near future. This is a common
reprogrammed dynamically while the rest of the device characteristic of most programs due to the presence of loops.
is executing user defined logic. In other words, different Going over to the other side, dissimilarities include absence
logic circuits can time-share the same part of this device. of write policies in reconfigurable hardware and differences in
One limitation of building customized processors on FPGAs replacement algorithms. It is easy to realize that reconfigurable
is the lack of hardware resources available for both a complete hardware does not need write policies as functional blocks,
processor core and a specialized instruction set. A functional no longer necessary can be simply overwritten. Replacement
processor core and a few hardware intensive instruction mod- techniques in caches overwrite lines that are deemed useless
ules can quickly consume all the resources of even the largest by replacement policies such as LRU, FIFO, RANDOM. How-
FPGAs available to day. One technique used to provide more ever, replacement techniques in hardware have to acknowledge
hardware resources for FPGAs is to reconfigure the FPGA the physical structure of the reconfigurable coprocessor. Some
during application execution. By constantly weeding out idle modules in digital systems require fixed locations in hardware
hardware from an FPGA, on-chip real estate can be recovered because of strict global and local physical constraints.
and more resources become available than that offered by a
onetime configured device[4][5]. III. G ENERIC P ERFORMANCE M ODEL
Working with Reconfigurable Hardware can be compared to The generic reconfigurable co processor specifies the fol-
that with Caches. The cache has the shortest access time and lowing system parameters.
the highest bandwidth as compared to main memory. Similarly,
• ts : normal block execution time-this is the average time
the execution time of functional blocks is much shorter in
required execute a functional block on the processor in
hardware than in software. However, the functional block is
software without a reconfigurable hardware.
to be present in the reconfigurable coprocessor first. In other
• k: speedup this is how much faster a functional block
words, we should have a function block hit, which sounds the
executes on average in the reconfigurable hardware over
same as a cache hit.
a software implementation.
Further, both reconfigurable hardware and cache exploit
• tc : function block call time this is the time required to
locality, a property of most programs. Spatial Locality, with
call a functional block in the reconfigurable hardware.
reference to cache, indicates that given a reference to a partic-
• tp : reconfigurable programming time this is the time
ular data location in memory, there exists a high probability
required to program a functional block into the system if
that other references will be made to data in neighboring
it is not already present.
locations. With reference to reconfigurable hardware, spatial
• Ph : probability of a hit this is the probability that a
locality is dependent on the organization of the main memory
functional block is present in the reconfigurable hardware,
that stores the instructions to reconfigure functional blocks in
i.e. it does not need to be programmed.
the coprocessor. But, reconfigurable hardware can make use
• tn : normal execution time between function blocks this is
of a different type of locality, functional locality. Programs do
the time the processor is executing between the execution
exhibit functional locality. For example, when working with
of function blocks.
a task such as coarse grain image processing, it is likely that
the following routine would be used. The general performance model executes as shown in
begin; Figure 1. Its evident from the flowchart that details such as
HIST image;execute the histogram operation on the image keeping an account of available hardware on chip estate, re-
located at image placement policies are done away with. Figure 2 and Figure 3
MEDIAN image threshold;calculate the median pixel present the timelines of the running of the system.
value on image and store in threshold T imesof tware = tn + tb
THRESH threshold image;enhance binary contrast of T imehardware = tn + Ph ∗ tc + (1 − Ph )(tc + tp )
T imesof tware
image by thresholding kspeed up = T imehardware
tn +tb
COPY image, image;copy entire image to a new location kspeed up = tn +Ph ∗tc +(1−P h )(tc +tp )
ERODE image;perform morphological erosion to image
DIFF image, image;subtract image from image to ob- The general performance model executes as shown in Fig-
tain outline ure 4. As is evident, t b , normal block execution time and
end tn , normal execution time between function blocks are both
Now, given a reference to the functional block HIST im- equally important in reducing the total execution time. Going
plies that the next few references could likely be MEDIAN, on to the reconfigurable processor, we have 4 variables, t n , Ph ,
THRESH, COPY, ERODE, DIFF in order. This high prob- tc , tp . Figure 5, 6, 7 show total execution time with all the
INSTRUCTION
FETCH 2.5

Total Execution Time


1.5

IS INSTRUCTION A
FUNCTIONAL BLOCK? 1
YES

0.5

0
1
IS BLOCK PRESENT IN
RECONFIGURABLE CO- NO 0.8 1
PROCESSOR? 0.6 0.8
0.4 0.6
0.4
0.2
FETCH FUNCTIONAL 0.2
BLOCK AND PROGRAM 0 0
Function Block Call time
NO Reconfiguration Time
INTO THE CO-
YES PROCESSOR

Fig. 5. Total Time for a Reconfigurable System, Ph = 0.9, tn =


0.0, 0.5, 1.0 units
INSTRUCTION
EXECUTE

2.5

Fig. 1. General Performance Model Flowchart


2
/12/2005 - 12/19/2005
3/12/2006 - 4/13/2006 3/5/2006 - 4/21/2006
3/12/2006 - 4/13/2006
Total Execution Time

BASIC, tn tb BASIC, tn tb 1.5

0.5

0
Fig. 2. Timeline when executing in software 1
0.8 1
12/2005 - 12/19/2005
3/12/2006 - 4/13/2006 3/5/2006 - 4/21/2006
2/26/2006 - 3/22/2006
3/12/2006 - 4/13/2006 0.6 0.8
3/12/2006 - 4/13/2006
0.4 0.6
BASIC, tn tc BASIC, tn tp tc BASIC, tn
0.4
0.2
0.2
0 0
Function Block Call time
Reconfiguration Time

Fig. 6. Total Time for a Reconfigurable System, Ph = 0.5, tn =


0.0, 0.5, 1.0 units
Fig. 3. Timeline when executing in reconfigurable co-processor

parameters varying. Observing the plots, one can see that for a
high function block time and a low block reconfiguration time,
2 the hit rate does not matter. Figure 8 reveals the break even line
for the two systems. Higher hit rates and lower reconfiguration
TOTAL EXECUTION TIME

1.5 time help to break even earlier. An indication of an increase


in speed up is the widening of the interleaving gap between
1
the two layers after breaking even.
0.5 IV. A PPLICATION P ERFORMANCE M ODEL

0
The application is characterized by a main iterative loop in
1
which a core operation is to be defined in a functional block.
0.8 1
0.6 0.8 The basic structure of the algorithm is:
0.4 0.6
0.4
0.2
0.2 initialization
0 0
EX. TIME BETWEEN FUNC. BLOCKS
NORMAL BLOCK EXECUTION TIME

for i=1 to N
Fig. 4. Total Time for a Normal Processor
Speedup = T imenormal
T imereconf ig
tinit +tbasic ∗N +tb ∗N +tclean up
Speedup = tinit +tbasic ∗N +tc ∗N +tp +tclean up
3

V. C ONCLUSIONS
2.5
Performance of a reconfigurable system is only visible if the
Total Execution Time

2
hit rates are favorable, else the reconfiguration times should be
1.5 low. Factors such as intelligent pre-fetching of function blocks
1
and replacement policies have not been considered which
would play a key role in performance and could liven up the
0.5
analysis. Parallelizing the working of the reconfigurable co-
0 processor could pay rich dividends. Reconfigurable Systems
1
0.8 1
can only fare better than their ASIC counterparts in applica-
0.6 0.8 tions where function blocks run sufficiently slowly in software
0.4 0.6
0.4 and the task being worked on needs a multitude of function
0.2
0.2
Function Block Call time
0 0
blocks. In other words, ”reconfigurable” worthy applications
Reconfiguration Time
have to be discovered to warranty the use of reconfigurable
Fig. 7. Total Time for a Reconfigurable System, Ph = 0.1, tn = systems.
0.0, 0.5, 1.0 units
R EFERENCES
[1] JR Hauser, J Wawrzynek, Garp: A MIPS Processor with a Reconfigurable
Coprocessor Workshop on FPGAs for Custom Computing Machines,
pp. 24–33, 1997.
[2] Toshiaki Miyazaki, Reconfigurable Systems: A Survey NTT Optical
2
NORMAL PROCESSOR
Network Systems Laboratories A1-329S, 3-1 Morinosato Wakamiya,
RECONFIGURABLE SYSTEM Atsugi, 243-01 JAPAN.
TOTAL EXECUTION TIME

1.5 [3] Sanchez, E. et. al., Static and Dynamic Configurable Systems IEEE
Trans. on Computers, 48, 6, June 1999, 556-563.
1
[4] Scott Hauck, Thomas W. Fry, Matthew M. Hosler, and Jeffrey P. Kao,
The Chimaera Reconfigurable Functional Unit IEEE Symposium on
FPGAs for Custom Computing Machines, 1997.
0.5
[5] Michael J. Wirthlin, Brad L. Hutchlings, DISC: The dynamic instruction
set computer IEEE Symposium on FPGAs for Custom Computing
0 Machines, 1995.
1
[6] K Compton, S Hauck , Reconfigurable computing: a survey of systems
0.8 1 and software ACM Computing Surveys, 2002.
0.6 0.8
0.4 0.6
0.4
0.2
0.2
FUNCTION BLOCK CALL TIME 0 0
NORMAL BLOCK EXECUTION TIME
AND NORMAL EX. TIME
AND
BETWEEN FUNC. BLOCKS
RECONFIGURATION TIME

Fig. 8. Break Even Line, Ph = 0.9, tn =


0.5 units f or reconf igurable system

Basic Computation

Function Block Computation

cleanup
1) Expected execution time for normal processor: Since
a normal processor would run the function block in
software, that leads us to conclude
T imenormal = tinit + tbasic ∗ N + tb ∗ N + tclean up
2) Expected execution time for reconfigurable system: The
system suffers a penalty of t p units for a cache miss
only during the first loop. As such, the remaining turns
of the loop will be executed in the co-processor without
a hitch. Hence,
T imereconf ig = tinit +tbasic ∗N +tc ∗N +tp +tclean up
3) Speedup: