You are on page 1of 5



An On-Chip Network Fabric Supporting Coarse-Grained Processor Array

Phi-Hung Pham, Phuong Mau, Jungmoon Kim, and Chulwoo Kim

AbstractCoarse grained arrays (CGAs) with run-time recongurability play an important role in accelerating recongurable computing applications. It is challenging to design on-chip communication networks (OCNs) for such CGAs with dynamic run-time recongurability whilst satisfying the tight budgets of power and area for an embedded system. This paper presents a silicon-proven design of a 64-PE circuit-switched OCN fabric with a dynamic path-setup scheme capable of supporting an embedded coarse-grained processor array. A proof-of-concept test chip fabricated in a 0.13 m CMOS process occupies a silicon area of 23 mm and consumes a peak power of 200 mW @ 128 MHz and 1.2 Vcc, at room temperature. The OCN overhead consumes 9.4% of the area and 18% of the power of the total chip. Experimental results and analysis show that the proposed OCN fabric with its dynamic path-setup is suitable for use in an embedded CGA supporting fast run-time recongurability. Index TermsCoarse grained array (CGA), network-on-chip (NoC), on-chip communication network, recongurable computing.

Fig. 1. Recongurable computing bridges the gap between ASICs and traditional general-purpose processors [1].

I. INTRODUCTION AND MOTIVATION Recongurable computing devices accelerate system performance by combining the area and power efciencies of application-specic integrated circuits (ASICs) and the exibility of the traditional general-purpose processors (GPPs) (see Fig. 1) [1]. The dynamic recongurability of such recongurable devices plays a key role for the system to adapt different computation workloads in run-time [1], [2]. Among the spectrum of run-time recongurable devices, coarse grained arrays (CGAs) are more favored than eld-programmable gate arrays (FPGAs) due to their lower reconguration overhead at a coarse-grained granularity [1]. A current design approach of using heterogonous intellectual properties (IPs) for recongurable devices can be considered as the upper bound of CGAs with maximum granularity [3]. Due to the exibility in supporting communication, the emerging network-on-chip (NoC) approach is an ideal solution for interconnection within a recongurable system [4]. The NoC approach is exploited to support the on-chip interconnection of dynamically recongurable devices in a scale from intra-IP to inter-IP [3], [5][9]. In most recongurable systems, such as Morpheus [3], MAGALI [6], and ReCore [8], the NoC approach is used for inter-IP interconnection, where the IPs can be (multi-grained) processors, CGA accelerators, dedicated digital signal processors (DSPs), or even FPGAs, etc. Another approach, RAMPSoC [9], is based on an FPGA implementation to provide more dimensions of recongurability [2]. This approach has the inter-IP NoC conguration adapted in run-time by exploiting the partial and dynamic recongurability of the FPGA [2], [9].
Manuscript received February 16, 2011; revised July 13, 2011; accepted December 07, 2011. Date of publication January 31, 2012; date of current version December 19, 2012. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MEST) (2011-0020128). The authors are with the School of Electrical Engineering, Korea University, Seoul, 136-713, South Korea (e-mail:; kr). Color versions of one or more of the gures in this paper are available online at Digital Object Identier 10.1109/TVLSI.2011.2181546

This work focuses on the design of an on-chip network1 (OCN) for a CGA, which is an essential accelerator in dynamically recongurable systems. A CGA accelerator is typically achieved by an array of arithmetic logic unit (ALU)-like PEs in order to exploit parallel and/or pipeline operations [1], [3], [8], [10], [11]. In most CGAs, the inter-PE interconnection is enabled via a network of multiplexing logic gates statically controlled by a conguration memory [1], [8], [10], [11]. Designing an on-chip network 1 (OCN) for a CGA, which typically computes at 16-bit, 8-bit, or smaller granularities, requires a particular consideration. First, due to the requirement of dynamic run-time recongurability (e.g., as required in cognitive systems [6]), the reconguration time must be short to meet a run-time requirement. Second, most CGAs accelerate real-time processing applications; it is mandatory for the designed OCN to support hard guaranteed communications. Third, the PE of CGA is compact due to its coarse-grained granularity. Therefore, the portion of overhead introduced by the OCN must be reasonably small in order to keep system efciency. With regard to inter-IP OCNs supporting guaranteed throughput in practice [3], [5], [6], [12], wormhole-switched (packet-switched) OCNs using prioritized packets with a deadlock-free routing scheme are used [3], [6], [12]. However, the use of queuing buffers for conicting data at their routers greatly increase the OCN overhead. A circuit-switched OCN [5] easily supports hard guaranteed communications, once its circuits are congured. This approach particularly suits on-chip implementation due to its compact overhead. However, the static (pre-runtime) conguration of the circuits makes it difcult to implement a mapping of data-ow graphs (DFGs) in run-time. This paper presents a novel silicon-proven OCN fabric for a coarsegrained processor array, which uses a circuit-switching scheme with a dynamic path setup capable of dynamically supporting guaranteed throughput among 64 PEs [7]. The proposed OCN results in area- and energy-efciency, and is suitable for use in a CGA. In addition, the distributed dynamic path setup brings system exibility and scalability for run-time mapping of DFGs. The rest of this paper is organized as follows. Section II presents the coarse-grained processor array interconnected by the proposed OCN. Section III shows the design and its implementation along with measurement results. Section IV gives a discussion about the dynamic pathsetup scheme of the proposed OCN, which is able to support run-time mapping of DFGs. Finally, Section V concludes this paper and outlines further research.
1This paper limits the discussion to the OCN for interconnection within a CGA (i.e., intra-IP OCN). This scope should not be confused with the traditional inter-IP OCN for recongurable systems as mentioned in works [3], [5], [6], [12].

1063-8210/$31.00 2012 IEEE



Fig. 2. Coarse-grained processor array interconnected by the proposed OCN.

II. PROPOSED ON-CHIP NETWORK SUPPORTING COARSE-GRAINED PROCESSOR ARRAY To design the proposed OCN, we rst discuss the design issues of topology selection and the path-setup scheme. A. Topology Consideration In practice, mesh topology is widely used due to its regularity and ease of layout in conventional 2-D chips [5], [12]. Torus topology has half the network diameter and twice the number of bisection connections compared to a mesh when accommodating the same number of PEs [13]. For this reason, we consider folded-torus, the laid-out version of torus on 2-D chip, as the topology of interest. We propose to use a dual-lane torus as the topology of our OCN. The dual-lane is adopted as a tradeoff between the area overhead and the path diversity of the proposed OCN. B. Distributed Probing Path-Setup Scheme In a pipelined circuit-switched network, communications typically occur in three phases, namely the setup (or conguration), the transmission, and the release phases [7]. In the setup phase, a dynamic and distributed path-setup scheme is critical for the reduction of setup latency and for system scalability. We propose to use a distributed path-setup scheme, where s probe containing the destination address is sent from source in order to establish a communication path towards destination. If there is congestion, the probe backtracks under a backtracking algorithm to search for alternative links, rather than waiting for the busy link to become idle. We use exhaustive protable backtracking (EPB) algorithm in order to achieve an energy efciency of minimum routing [14]. A previous work [15] proves that an EPB-based path-setup scheme is deadlock-free and livelock-free. Each switching node maintains its own probing activity. Hence, the implemented path-setup scheme in our design is distributed. In the transmission phase, the reserved paths allow transmission of guaranteed data with a source clock pipelined from source to destination. The source-synchronous pipeline scheme is feasible and efcient for global on-chip communications [5]. In the release phase, the reserved path is released in a hop-to-hop basis from source to destination. C. Proposed On-Chip Network A combination of the distributed probing path-setup and the good path diversity of the dual-lane torus architecture is exploited to provide circuit-switched guaranteed communications in our OCN. Fig. 2 presents our CGA platform, which is interconnected by the proposed 64-PE OCN with dual-lane folded-torus topology. Each tile contains a programmable RISC-based PE, a programming control unit, a communication wrapper, and a switching node comprised of two probing switches for dual lane communication. The programming control unit

Fig. 3. (a) Proposed probing switch architecture, based on (b) FSM implementation of Ctrl_In and (c) MUX-based crossbar structure.

allows run-time updating of program memory in each PE. Fig. 2 denotes a compact switch-by-switch interconnection including Request and Answer handshake signals. The 1-bit Request denotes two states: the on-probing (or path-setup) state and the circuit idling state. The 2-bit Answer has one of three values that direct backpressure ow-control. An answer of 01 means that the receiver acknowledges the pathsetup request from the sender. An answer of 10 represents the state in which the setup is blocked by the network, forcing the probe to backtrack to nd alternative paths. An answer of 11 is reserved for the case when the receiver is not ready for the path-setup request (e.g., overow at the receiving buffer). The designed probe has 6 bits for destination address, 1 bit for lane identity, and 1 bit for intra-lane priority. III. DESIGN AND IMPLEMENTATION Our test chip has two main components: the probing switch and the processing element (PE). They communicate with each other through a software-driven wrapper. A. Probing Switch Design The probing switches, constituting the switching node, are networked to construct a guaranteed throughput lane in our OCN. The probing switch performs the probing distributed path setup, and supports a source-synchronous communication. Fig. 3(a) shows the proposed probing switch architecture, which is derived from the switch architecture of work [15]. In the proposed switch, Ctrl_Ins process incoming probes from the neighboring switches and the PE. Fig. 3(b) presents the simplied FSM implementation of Ctrl_In. When in the Probing state, Ctrl_In compares the current switch address with the destination address contained in the probe and dynamically constructs a lookup table of possible outputs to route the probe. Based on output availability and feedback from the downstream switches, the Ctrl_In may change into the ACK, nACK or Backtrack states. ARBITER arbitrates the incoming requests from the Ctrl_Ins based on the priority bit contained in the probe; it acts as a small crossbar to interconnect the internal Request and Answer signals between the Ctrl_Ins and corresponding Ctrl_Outs. The Ctrl_Outs perform a handshake with the downstream switch and control the CROSSBAR. When locked into a specic selecting state, the Ctrl_Out controls the CROSSBAR in order to establish a direct connection from the Data_In to the target Data_out. As shown in Fig. 3(c), the multiplexer-based structure of CROSSBAR easily allows direct passing of a source-synchronous transmission. This structure is also feasibly synthesized in a standard-cell based design [7].



Fig. 4. Programmable RISC-based processing element.

Fig. 5. (a) Measured frequency and power versus lated power efciencies (64 PE-tiles activated).

scaling and (b) calcu-

B. Processing Element Fig. 4 shows the structure of the programmable PE. The PE is an 8-bit RISC with only 33 single word/single cycle instructions. This design makes PE programming simple and predictable. The RISC uses Harvard architecture (512Wx12b program, 25B register-le and a 64Wx8b data scratchpad memory) to support the basic computations and control the software-driven wrapper. The RISCs ALU includes an 8-bit multiplier that performs 8-bit multiplication in one clock cycle. The PE can be programmed to enter sleep mode (leakage power consumption only) by a soft SLEEP command, and can then be woken-up by a programmable internal timer. The PEs built-in UART and 12-bit bidirectional I/O interface are used for off-chip communication and debugging purposes. A communication wrapper [7] is placed between a PE and its corresponding switching node. It allows handling of end-to-end ow-control from the software level (based on C language), and supports a source-synchronous communication scheme at the circuit level. The wrapper connects to the system bus of the PE through control/status registers and data registers. The wrapper includes Tx/Rx rst-inputs rst-outputs (FIFOs) for implementation of source-synchronous transmission. C. Test Chip and Measurement Results The test chip is designed by using a high-density 1P6M 0.13-m CMOS Std-cell technology, fabricated, and packaged in a 208-pin LQFP. A four-level H-tree mesochronous clock network delivers a 50%-duty-cycle clock from a clock input buffer to 64 PE tiles. Simple looping operations of 8/16-bit xed-point numbers are compiled and run in all PEs to measure peak performance. At room temperature, the measured peak power consumptions are 38.2 mW @42 MHz (0:8 V Vcc ) and 200 mW @128 MHz (1:2 V Vcc ), with energy efciencies of 68.4 and 40.7 GOPS/W, respectively. Fig. 5 plots the measured performance, the power and the energy efciency of the total platform under Vcc and frequency scaling. The OCN overhead occupies 9.4% of the area and consumes 18% of the power of the total chip. The OCN provides a bisection bandwidth of 44.54 Gb/s @174 MHz (1:6 V Vcc ). Fig. 6 presents the die photo and summary of the test-chip. There are 178 user-dedicated I/O pins (padded from edge PEs) providing a total I/O bandwidth of 22.8 Gb/s @128 MHz. An increase of the I/O capability is possible if a chip package with higher pin-count is used. IV. PATH-SETUP ANALYSIS AND DISCUSSION As mentioned in Section I, on-chip communication plays an important role in a dynamic recongurable system. It depends on how exibly an OCN can support run-time DFGs. This section addresses the

Fig. 6. Test chip die photo and summary.

capability of the dynamic and distributed path-setup scheme to adapt the system to arbitrary DFGs exhibited in run-time mappings. We limit the discussion to how exibly the OCN supports the mappings rather than to application optimization aspects. Therefore, the applications are chosen to be simple enough to remove the mapping complexity, while still exposing various types of DFGs. A. Example 1: Run-Time Implementation of a Finite-Impulse Response (FIR) Filter FIR lter is used in many real-time DSP systems to perform signal preconditioning, anti-aliasing, band selection, low-pass ltering, video convolution functions, decimation/interpolation operations, etc. This example shows the implementation of an 8-bit 8-tap linear phase response FIR lter with symmetric coefcients. Fig. 7 presents the FIR computation graph mapped into 11 PEs. The rst-row PEs, i.e., (0, 7), (1, 7), (2, 7), and (3, 7), perform delayed additions with delay factors of 7, 5, 3, and 1, respectively. Therefore, the rst-row data ow is simplied, where the x(n) can be moved circularly from (0,7) to (3,7) [see Fig. 7(b)]. The second-row PEs, i.e., (0, 6), (1, 6), (2, 6), and (3, 6) , perform multiplications with xed coefcients. The cycle-accurate model of implementation shows that the computation time (Tcomp by PEs) and the communication time (Tcomm by the OCN with FIFO-based wrappers) can overlap in every pipeline stage. It is observed that the dual-lane conguration of the OCN provides more paths than are needed. Therefore, all the one-time set up guaranteed paths for



Fig. 7. Implementation of an 8-tap linear phase response FIR: (a) the computational graph and (b) the DFG mapped into the processor array.

this real-time DFG can remain during the application run. Combined with the measurement result, the model estimates that this implementation provides a real-time FIR throughput of around 21.3 Msps @128 MHz (with a power consumption of 23 mW). Assuming that the programs are previously stored in the memories of PEs, the launching time (calculated based on the maximum path-setup latency) of this mapping is only around 63 ns (i.e., 8 cycles). B. Example 2: Run-Time Implementation of a 16-Point FFT FFT is one of the most popular algorithms used in real-time DSP applications. This example illustrates a run-time implementation of 16-point radix-2 decimal-in-frequency FFT. It is assumed that the data-widths of input x(n) and output X (n) are 8 and 16 bits, respectively. The implementation computes based on 16b 8b-fractional xed-point numbers. The computation graph and the communication ows mapped into 32 PEs are shown in Fig. 8(a) and (b), respectively. In Fig. 8(b), the rst-stage PEs act as commutators, which steer the data ows to the second-stage. Because the required number of direct data ows exceeds the number of physical paths in the communications from the rst 0 ! second , and from the second 0 ! third stages, these ows need to be scheduled (time sharing) in each lane. Following this schedule, the wing ows occur in time slot t1, and the traverse ows occur in time slot t2 of the inter-stage communication duration, as shown in Fig. 8(b). The worst-case path-setup latency of inter-stage communications is calculated from the cycle-accurate model in order to specify the timing margin for the schedule. The cycle-accurate model estimates that this implementation processes one FFT every 110 cycles , with a processing latency of 380 cycles. Combined with the measurement result, this implementation provides approximately a real-time FFT throughput of 18.6 Msps with a processing latency of 3 s per each FFT at 128 MHz, and consumes about 63 mW. The mapping can be launched within 34 cycles, or 266ns @128 MHz. By analyzing the above implementation examples, it is shown that the proposed OCN can exibly guarantee the run-time mapping of DFGs in both the space and time domains. Regarding the space domain, the dynamic path setup enables the DFGs with their data ows mainly exhibited over local distances (as in Example 1) as well as global distances (as in Example 2). Regarding the time domain, the path-setup scheme with its short setup time is not only able to quickly launch the mappings (as in both the examples), but also enables the time-scheduled ows while an application is running (as seen in Example 2). Table I gives a comparison of our OCN with other related siliconproven OCNs for CGAs from an on-chip networking viewpoint. These OCNs use mesh-based topologies feasible for 2-D-chip layout and support pre-congured data-paths suitable for hard real-time on-chip communication. Due to the heterogeneity of the design approach and data

Fig. 8. Implementation of a 16-point FFT: (a) computational graph and (b) the DFG mapped into the processor array.

availability, a direct comparison of metrics is difcult. Nevertheless, Table I suggests some interesting points of comparison, which can be inferred from our circuit-switching approach with dynamic path setup. Compared to a static circuit-switched OCN [5] with a similar tile-based layout, our approach results in a comparable portion of OCN overhead (i.e., 9.4% versus 7%), whereas our approach features run-time recongurability. Compared to XPP-III [10] and PiCoGA-III [11] implemented in Morpheus [3], the designed coarse-grained processor array with the proposed OCN suggests a suitability (in terms of area overhead, power and guaranteed bandwidth) for use as embedded CGA in a recongurable system. In particular, the reconguration latency of the OCN (in tens of cycles) is faster than those of XPP-III [10] and PiCoGA-III [11] (in hundreds or even thousands of cycles). This advantage results from the dynamic and distributed path-setup scheme in which the DFGs can be set up in parallel. Meanwhile, XPP-III [10] and PiCoGA-III [11], as well as the interconnection within an acceleration tile, called Montium, in the ReCore system [8], use a centralized scheme to congure DFGs by updating a global conguration memory. It is worth noting that the OCNs listed in Table I target silicon implementation. They are different from the RAMPSoC approach [9], in which the OCN conguration combining both circuit-switching and packet-switching schemes is adapted in run-time. This approach can result in an optimized architecture for the application [2]. However, this adaptation may require a long reconguration time due to the bit-level recongurability of the FPGA implementation. We believe that the fast reconguration feature resulting from our dynamic and decentralized approach can be feasible for future fast recongurable applications, such as in cognitive radio [6]. V. CONCLUSION This paper presented a compact silicon-proven OCN fabric implemented in a 0.13-m CMOS technology for use in a coarse-grained processor array. The implemented circuit-switching scheme provides pre-congured data-paths that are feasible for guaranteeing hard real-time on-chip communication. The unique dynamic and distributed path-setup scheme well supports run-time recongurable applications that require fast recongurability. This work also introduced a class of OCNs applicable in embedded coarse-grained recongurable arrays with run-time recongurability. Future works will focus on a development tool chain for implementing recongurable computing applications from a high-level language.




ACKNOWLEDGMENT The authors would like to thank IC Design Education Center (IDEC) and the Korea Ministry of Knowledge Economy (MKE) for the fabrication of the chip.

REFERENCES [1] R. Hartenstein, A decade of recongurable computing: a visionary retrospective, Proc. Design, Autom., Test Euro. (DATE), pp. 642649, 2001. [2] M. Platzner, J. Teich, and N. Wehn, Dynamically Recongurable Systems: Architectures, Design Methods and Applications. New York: Springer, 2010. [3] D. Rossi, F. Campi, S. Spolzino, S. Pucillo, and R. Guerrieri, A heterogeneous digital signal processor for dynamically recongurable computing, IEEE J. Solid-State Circuits, vol. 45, no. 8, pp. 16151626, Aug. 2010. [4] S. Rodrige, I. S. Silva, and A. Azevedo, When recongurable architecture meets network-on-chip, in Proc. Symp. Integr. Circuits Syst. Design, 2004, pp. 216221. [5] D. N. Truong, W. H. Cheng, T. Mohsenin, Y. Zhiyi, A. T. Jacobson, G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, X. Zhibin, E. W. Work, J. W. Webb, P. V. Mejia, and B. M. Baas, A 167-Processor computational platform in 65 nm CMOS, IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 11301144, Apr. 2009. [6] F. Clermidy, C. Bernard, R. Lemaire, J. Martin, I. Miro-Panades, Y. Thonnart, P. Vivet, and N. Wehn, A 477 mW NoC-based digital baseband for MIMO 4G SDR, in ISSCC Dig. Tech. Papers, 2010, pp. 278279. [7] P.-H. Pham, P. Mau, and C. Kim, A 64-PE folded-torus intra-chip communication fabric for guaranteed throughput in network-on-Chip based applications, in Proc. IEEE Custom Integr. Circuits Conf. (CICC), 2009, pp. 645648. [8] G. K. Rauwerda, P. M. Heysters, and G. J. M. Smit, Towards software dened radios using coarse-grained recongurable hardware, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 1, pp. 313, Jan. 2008. [9] D. Gohringer, L. Bin, M. Hubner, and J. Becker, Star-wheels network-on-chip featuring a self-adaptive mixed topology and a synergy of a circuitand a packet-switching communication protocol, in Proc. Field Program. Logic Appl. (FPL), 2009, pp. 320325. [10] PACT XPP Technologies Munich, Germany [Online]. Available: http:// [11] A. Lodi, C. Mucci, M. Bocchi, A. Cappelli, M. De Dominicis, and L. Ciccarelli, A multi-context pipelined array for embedded systems, in Proc. Int. Conf. Field Program. Logic Appl., 2006, pp. 18. [12] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, An 80-Tile Sub-100-W TeraFLOPS processor in 65-nm CMOS, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 2941, Jan. 2008. [13] W. J. Dally and B. Towles, Route packets, not wires: on-chip interconnection networks, in Proc. ACM/IEEE DAC, 2001, pp. 684689.

[14] P. T. Gaughan and S. Yalamanchili, A family of fault-tolerant routing protocols for direct multiprocessor networks, IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 482497, May 1995. [15] P.-H. Pham, J. Park, P. Mau, and C. Kim, Design and implementation of backtracking wave-pipeline switch to support guaranteed throughput in network-on-Chip, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 2, pp. 270283, Feb. 2012.

A Very Linear Low-Pass Filter with Automatic Frequency Tuning

J. Galn, M. Pedro, T. Snchez-Rodrguez, F. Muoz, R. G. Carvajal, and A. Lpez-Martn

AbstractA Gm-C third-order Chebyshev low-pass lter with a novel switched capacitor frequency tuning technique for a zero-IF Bluetooth receiver has been designed. The frequency tuning scheme is simpler and has more relaxed specications than conventional ones. Furthermore, a highly linear pseudo-differential transconductor with a compact feedback loop able to operate with low supply voltage has been used. This control loop holds the input transistors in triode region and provides high output resistance, keeping high linearity in a wide range of transconductance. The lter bandwidth is 0.5 MHz and the overall scheme consumes 1.1 mA from a 1.8-V supply. The measured third-order intermodulation (IM3) distortion of the lter for a 1 Vpp two-tone signal centered at 300 kHz is 65 dB. Index TermsAnalog CMOS circuits, direct-conversion receivers, frequency tuning, Gm-C lters, linear transconductors. Manuscript received May 06, 2011; revised September 29, 2011; accepted December 07, 2011. Date of publication January 23, 2012; date of current version December 19, 2012. This work was supported by the Spanish Ministerio de Ciencia e Innovacin under Projects FPA2010-22131-C02-02 and TEC2010-21563-C02-02. J. Galn, M. Pedro, and T. Snchez-Rodrguez are with the Department of Electronic Engineering, University of Huelva, Huelva E-21071, Spain (e-mail:;; F. Muoz and R. G. Carvajal are with the Escuela Superior de Ingenieros, Universidad de Sevilla, Sevilla E-41092, Spain (e-mail:; A. Lpez-Martn is with the Departamento de Ingeniera Elctrica y Electrnica, Universidad Pblica de Navarra, Pamplona E-31006, Spain (e-mail: Color versions of one or more of the gures in this paper are available online at Digital Object Identier 10.1109/TVLSI.2011.2181880

1063-8210/$31.00 2012 IEEE