Performance Model for a Reconfigurable Coprocessor

Indrajeet Kalyankar
Department of Electrical and Computer Engineering Old Dominion University, Norfolk, Virginia 23529 Email: ikaly001@odu.edu

I. I NTRODUCTION The important characteristic of a Reconfigurable Logic Module is its ability to perform computations in hardware to increase performance, while being as mouldable as a software solution. Although more expensive than ASICs, FPGAs provide a simplified low cost design platform. Given a computational task, three approaches can be taken to solve the problem. The first is to use hard wired technology, either an Application Specific Integrated Circuit (ASIC) or a bunch of individual components forming an on-board solution, to perform the task in hardware. ASICs are designed specifically to perform a given computation, and thus are very fast and efficient[1] when executing the exact computation they were designed for. However, the circuit cannot be altered after fabrication. This necessitates a redesign and re fabrication of the chip if any part of its circuit requires modification. This is an expensive process, especially when one considers the problems in replacing ASICs in a large number of systems. Board-level circuits are also inflexible to a certain degree, often requiring a board redesign and replacement in the event of changes to the application. The second method is to use software-programmed microprocessors, a far more flexible solution. Processors execute a sequence of instructions to perform a computation. By changing the machine instructions, the functionality of the system is altered without changing the hardware. However, the downside of this flexibility is that the performance suffers, and is rather far below that of an ASIC. The processor must fetch each instruction from memory, decode its meaning, and only then execute it. This results in a high execution overhead for each individual operation. Additionally, the set of instructions that may be used by a program is determined at the fabrication time of the processor. Any other operations that are to be implemented must be built out of existing instructions. Video games can also be said to reconfigurable, because they can be changed to different game machines by exchanging ROM cartridges or CD-ROMs. In addition, todays micro controller based embedded systems, such as those found in automobiles and almost all household electric appliances, can be categorized in the same group, because different functions are provided by changing program ROMs. However, for all of these examples, the reconfigurabil-

ity is either in the software and to a certain extent in the hardware. Hence the distinction can be referred to as programmable processors and configurable ones[3]. Reconfigurable computing is intended to bridge the gap between hardware and software, achieving potentially much higher performance than software, while maintaining a higher level of flexibility than hardware. Reconfigurable devices, including field-programmable gate arrays (FPGAs), contain an array of computational elements whose functionality is determined through multiple programmable configuration bits[6]. These elements, some times known as logic blocks, are connected using a set of routing resources that are also programmable. In this way, custom digital circuits can be mapped to the reconfigurable hardware by computing the logic functions of the circuit within the logic blocks, and using the configurable routing to connect the blocks together to form the necessary circuit. Perhaps more attractive however is the ability of the FPGA to operate as an unlimited number of unique circuits through device configuration. The cost of the device can be amortized over the various circuits that operate on the device. One method used to mix the programmability of processor based systems with the hardware configurability of FPGA based systems is to implement application specific processors on an FPGA. II. BACKGROUND Today, many kinds of reconfigurable logic devices are available, i.e., FPGAs and CPLDs (Complex Programmable Logic Devices), and they are the keys to constructing. Almost all reconfigurable systems use commercially available FPGAs/CPLDs, but some utilize custom reconfigurable chips. In this section, FPGAs/CPLDs are categorized according to the characteristics[2]. • Configurable logic: This is a general term for logic devices that can be customized one time. • Reconfigurable logic: This refers to logic devices that can be customized many times. As expected, these devices often adopt EPROM, EEPROM, or FLASH technology. • Dynamically reconfigurable logic: This supports on thefly programming capability after mounting on a system board. It is often called in-circuit reconfiguration. • Dynamically reconfigurable interconnect: This is a general term for interconnect devices that can be pro-

grammed pin-to-pin connections after mounting on a system board. They are similar to the above SRAM based FPGAs, but do not have any programmable logic blocks. • Virtual logic: This is a kind of dynamically reconfigurable logic device, but it features partial reconfiguration capability. This mechanism allows part of the device to be reprogrammed dynamically while the rest of the device is executing user defined logic. In other words, different logic circuits can time-share the same part of this device. One limitation of building customized processors on FPGAs is the lack of hardware resources available for both a complete processor core and a specialized instruction set. A functional processor core and a few hardware intensive instruction modules can quickly consume all the resources of even the largest FPGAs available to day. One technique used to provide more hardware resources for FPGAs is to reconfigure the FPGA during application execution. By constantly weeding out idle hardware from an FPGA, on-chip real estate can be recovered and more resources become available than that offered by a onetime configured device[4][5]. Working with Reconfigurable Hardware can be compared to that with Caches. The cache has the shortest access time and the highest bandwidth as compared to main memory. Similarly, the execution time of functional blocks is much shorter in hardware than in software. However, the functional block is to be present in the reconfigurable coprocessor first. In other words, we should have a function block hit, which sounds the same as a cache hit. Further, both reconfigurable hardware and cache exploit locality, a property of most programs. Spatial Locality, with reference to cache, indicates that given a reference to a particular data location in memory, there exists a high probability that other references will be made to data in neighboring locations. With reference to reconfigurable hardware, spatial locality is dependent on the organization of the main memory that stores the instructions to reconfigure functional blocks in the coprocessor. But, reconfigurable hardware can make use of a different type of locality, functional locality. Programs do exhibit functional locality. For example, when working with a task such as coarse grain image processing, it is likely that the following routine would be used. begin; HIST image;execute the histogram operation on the image located at image MEDIAN image threshold;calculate the median pixel value on image and store in threshold THRESH threshold image;enhance binary contrast of image by thresholding COPY image, image;copy entire image to a new location ERODE image;perform morphological erosion to image DIFF image, image;subtract image from image to obtain outline end Now, given a reference to the functional block HIST implies that the next few references could likely be MEDIAN, THRESH, COPY, ERODE, DIFF in order. This high prob-

ability of a sequence of references to functional blocks that pertain to a definite task is called functional locality. Temporal locality can be exploited by both caches and reconfigurable hardware. Given some references to group of locations, it is highly probable that the same locations will be referenced again in the near future. This is a common characteristic of most programs due to the presence of loops. Going over to the other side, dissimilarities include absence of write policies in reconfigurable hardware and differences in replacement algorithms. It is easy to realize that reconfigurable hardware does not need write policies as functional blocks, no longer necessary can be simply overwritten. Replacement techniques in caches overwrite lines that are deemed useless by replacement policies such as LRU, FIFO, RANDOM. However, replacement techniques in hardware have to acknowledge the physical structure of the reconfigurable coprocessor. Some modules in digital systems require fixed locations in hardware because of strict global and local physical constraints. III. G ENERIC P ERFORMANCE M ODEL The generic reconfigurable co processor specifies the following system parameters. • ts : normal block execution time-this is the average time required execute a functional block on the processor in software without a reconfigurable hardware. • k: speedup this is how much faster a functional block executes on average in the reconfigurable hardware over a software implementation. • tc : function block call time this is the time required to call a functional block in the reconfigurable hardware. • tp : reconfigurable programming time this is the time required to program a functional block into the system if it is not already present. • Ph : probability of a hit this is the probability that a functional block is present in the reconfigurable hardware, i.e. it does not need to be programmed. • tn : normal execution time between function blocks this is the time the processor is executing between the execution of function blocks. The general performance model executes as shown in Figure 1. Its evident from the flowchart that details such as keeping an account of available hardware on chip estate, replacement policies are done away with. Figure 2 and Figure 3 present the timelines of the running of the system. T imesof tware = tn + tb T imehardware = tn + Ph ∗ tc + (1 − Ph )(tc + tp ) T imesof tware kspeed up = T imehardware tn +tb kspeed up = tn +Ph ∗tc +(1−Ph )(tc +tp ) The general performance model executes as shown in Figure 4. As is evident, t b , normal block execution time and tn , normal execution time between function blocks are both equally important in reducing the total execution time. Going on to the reconfigurable processor, we have 4 variables, t n , Ph , tc , tp . Figure 5, 6, 7 show total execution time with all the

INSTRUCTION FETCH

2.5

2 Total Execution Time

1.5

IS INSTRUCTION A FUNCTIONAL BLOCK? YES

1

0.5

IS BLOCK PRESENT IN RECONFIGURABLE COPROCESSOR?

0 1

NO

0.8 0.6 0.4 0.6 0.2 Function Block Call time 0 0 0.4 0.2 Reconfiguration Time 0.8

1

NO YES

FETCH FUNCTIONAL BLOCK AND PROGRAM INTO THE COPROCESSOR

Fig. 5. Total Time for a Reconfigurable System, Ph = 0.9, tn = 0.0, 0.5, 1.0 units
INSTRUCTION EXECUTE

2.5

Fig. 1.

General Performance Model Flowchart
2 Total Execution Time

/12/2005 - 12/19/2005 3/12/2006 - 4/13/2006 3/5/2006 - 4/21/2006 3/12/2006 - 4/13/2006 BASIC, tn BASIC, tn tb tb

1.5

1

0.5

Fig. 2.

Timeline when executing in software

0 1 0.8 1 0.6 0.4 0.2 Function Block Call time 0 0 0.4 0.2 Reconfiguration Time 0.6 0.8

12/2005 - 12/19/2005 - 4/13/2006 - 4/13/2006 3/12/2006 3/5/2006 - 4/21/2006 - 3/22/2006 - 4/13/2006 2/26/2006 3/12/2006 3/12/2006 BASIC, tn BASIC, tn tc tp tc BASIC, tn

Fig. 6. Total Time for a Reconfigurable System, Ph = 0.5, tn = 0.0, 0.5, 1.0 units Fig. 3. Timeline when executing in reconfigurable co-processor

2

TOTAL EXECUTION TIME

1.5

1

parameters varying. Observing the plots, one can see that for a high function block time and a low block reconfiguration time, the hit rate does not matter. Figure 8 reveals the break even line for the two systems. Higher hit rates and lower reconfiguration time help to break even earlier. An indication of an increase in speed up is the widening of the interleaving gap between the two layers after breaking even. IV. A PPLICATION P ERFORMANCE M ODEL The application is characterized by a main iterative loop in which a core operation is to be defined in a functional block. The basic structure of the algorithm is: initialization for i=1 to N
Fig. 4. Total Time for a Normal Processor

0.5

0 1 0.8 0.6 0.4 0.2 EX. TIME BETWEEN FUNC. BLOCKS 0 0 0.4 0.2 NORMAL BLOCK EXECUTION TIME 0.6 0.8 1

Speedup = Speedup =
3 2.5 Total Execution Time 2 1.5 1 0.5 0 1 0.8 0.6 0.4 0.2 Function Block Call time 0 0 0.4 0.2 Reconfiguration Time 0.6 0.8 1

T imenormal T imereconf ig tinit +tbasic ∗N +tb ∗N +tclean up tinit +tbasic ∗N +tc ∗N +tp +tclean up

V. C ONCLUSIONS Performance of a reconfigurable system is only visible if the hit rates are favorable, else the reconfiguration times should be low. Factors such as intelligent pre-fetching of function blocks and replacement policies have not been considered which would play a key role in performance and could liven up the analysis. Parallelizing the working of the reconfigurable coprocessor could pay rich dividends. Reconfigurable Systems can only fare better than their ASIC counterparts in applications where function blocks run sufficiently slowly in software and the task being worked on needs a multitude of function blocks. In other words, ”reconfigurable” worthy applications have to be discovered to warranty the use of reconfigurable systems. R EFERENCES
[1] JR Hauser, J Wawrzynek, Garp: A MIPS Processor with a Reconfigurable Coprocessor Workshop on FPGAs for Custom Computing Machines, pp. 24–33, 1997. [2] Toshiaki Miyazaki, Reconfigurable Systems: A Survey NTT Optical Network Systems Laboratories A1-329S, 3-1 Morinosato Wakamiya, Atsugi, 243-01 JAPAN. [3] Sanchez, E. et. al., Static and Dynamic Configurable Systems IEEE Trans. on Computers, 48, 6, June 1999, 556-563. [4] Scott Hauck, Thomas W. Fry, Matthew M. Hosler, and Jeffrey P. Kao, The Chimaera Reconfigurable Functional Unit IEEE Symposium on FPGAs for Custom Computing Machines, 1997. [5] Michael J. Wirthlin, Brad L. Hutchlings, DISC: The dynamic instruction set computer IEEE Symposium on FPGAs for Custom Computing Machines, 1995. [6] K Compton, S Hauck , Reconfigurable computing: a survey of systems and software ACM Computing Surveys, 2002.

Fig. 7. Total Time for a Reconfigurable System, Ph = 0.1, tn = 0.0, 0.5, 1.0 units

2 NORMAL PROCESSOR RECONFIGURABLE SYSTEM TOTAL EXECUTION TIME 1.5

1

0.5

0 1 0.8 0.6 0.4 0.2 FUNCTION BLOCK CALL TIME AND NORMAL EX. TIME BETWEEN FUNC. BLOCKS 0 0 0.4 0.2 NORMAL BLOCK EXECUTION TIME AND RECONFIGURATION TIME 0.6 0.8 1

Fig. 8. Break Even Line, Ph 0.5 units f or reconf igurable system

=

0.9,

tn

=

Basic Computation Function Block Computation cleanup 1) Expected execution time for normal processor: Since a normal processor would run the function block in software, that leads us to conclude T imenormal = tinit + tbasic ∗ N + tb ∗ N + tclean up 2) Expected execution time for reconfigurable system: The system suffers a penalty of t p units for a cache miss only during the first loop. As such, the remaining turns of the loop will be executed in the co-processor without a hitch. Hence, T imereconf ig = tinit +tbasic ∗N +tc ∗N +tp +tclean up 3) Speedup: