A Modular Coprocessor Architecture for Embedded Real-Time Image and Video Signal Processing

Holger Flatt, Sebastian Hesselbarth, Sebastian Fl¨ gel, and Peter Pirsch u
Institut f¨r Mikroelektronische Systeme, u Gottfried Wilhelm Leibniz Universit¨t Hannover, a Appelstr. 4, 30167 Hannover, Germany {flatt,hesselbarth,fluegel,pirsch}@ims.uni-hannover.de

Abstract. This paper presents a modular coprocessor architecture for embedded real-time image and video signal processing. Applications are separated into high-level and low-level algorithms and mapped onto a RISC and a coprocessor, respectively. The coprocessor comprises an optimized system bus, different application specific processing elements and I/O interfaces. For low volume production or prototyping, the architecture can be mapped onto FPGAs, which allows flexible extension or adaption of the architecture. Depending on the complexity of the coprocessor data paths, frequencies up to 150 MHz have been achieved on a Virtex II-Pro FPGA. Compared to a RISC processor, the performance gain for an SSD algorithm is more than factor 70.

1

Introduction

In recent years, integration of smart image and video processing algorithms in sensor devices increased. Applications like object detection, tracking, and classification demand high computing performance. It is desirable to have embedded signal processing integrated in the sensor, e.g. video cameras, to perform image or video compression, filtering, and data reduction techniques. Moreover, flexibility is mandatory where new applications have to be supported or existing code needs to be modified. General purpose processors can be used as a first approach for embedded realtime image and video signal processing. They comprise instruction set extensions like SSE, which allow SIMD operations [1], but efficiency of execution units remains low [2]. Moreover, due to their high power consumption, they are not suitable for embedded systems. As an alternative to general purpose processors, digital signal processors (DSP) provide high performance at low power consumption. While exploiting the inherent parallelism of algorithms, they are inferior to application specific arithmetic cores [3]. Dedicated arithmetic cores provide highest optimization potential for a specific application, but lack flexibility if support to different applications is required.
S. Vassiliadis et al. (Eds.): SAMOS 2007, LNCS 4599, pp. 241–251, 2007. c Springer-Verlag Berlin Heidelberg 2007

High-level algorithms (HLA) consist of data depended decisions and control operations. A coprocessor that is optimized for processing of low-level algorithms is superior to a RISC. which can be easily modified or extended for different applications. Due to its complexity. This embedded system shown in figure 1 comprises a RISC core. irregular high-level parts of the application are executed on a RISC core. while their reusability is low. To reduce modularization efforts for peripheral and coprocessor units. They are regular and have high potential for parallel execution. It is advantageous if the hardware architecture combines processing cores that are optimized for the execution of algorithms associated to one of the layers. they can be executed either on a RISC or on a coprocessor. Actual FPGAs allow real-time signal processing of sophisticated algorithms. In [5] a special programmable coprocessor was combined with a RISC. While dedicated processing elements inside the coprocessor compute time-consuming low-level operations. The combination of RISC and coprocessor considers algorithmic complexity of embedded applications [6]. 1. Design and modification of dedicated units are time-consuming. Low-level algorithms (LLA) comprise simple computing operations that need high processing power. data I/O. a reconfigurable coprocessor. This coprocessor is optimized for processing of low-level algorithms.242 H. A RISC processor supports mapping of complex HLAs with low development effort if a compiler is available. commercial system-on-chip bus systems like AMBA AHB [7] require complex finite state machines for master units. Flatt et al. In this paper. The analysis of the algorithmic hierarchy of image and video processing applications yields three layers of hardware abstraction [4]. the focus is on FPGA implementations. They require high flexibility. Depending on their complexity. a modular coprocessor architecture for a generic embedded system is proposed. a simplified multilayer communication bus is introduced. Due to their reconfigurability. Peripheral Host PC (optional) Sensor Actuator Embedded System RISC I/O Coprocessor Memory Fig. Medium-level algorithms (MLA) are situated between HLAs and LLAs. although the architecture is also suitable for ASIC implementation. and memory interfaces. They provide high . Embedded architecture and peripheral units The architecture is designed for accelerating different image and video signal processing algorithms. Commonly used and feature-rich. debug. adaptations and extensions of the architecture are time-consuming.

RISC Coprocessor Control Unit (Data transfers) Application (HLAs) Dynamic Resource Scheduler Processing Element 1 . Chapter 2 gives a brief description of the proposed coprocessor architecture and the communication approach. internal memories and arithmetic cores [8] [9].1 Embedded Coprocessor Architecture Communication Approach In order to utilize dedicated hardware acceleration units. g. If the processes on the RISC and the coprocessor are frequently synchronized. After a MLA function . 2. Chapter 3 shows an application example and the design flow for dedicated processing elements. This paper is organized as follows. the RISC calls and synchronizes low-level algorithms that consist of a set of micro instructions. chapter 4 presents verification and results. These embedded cores allow the integration of a RISC with a coprocessor in one device [11].. Figure 2 shows the structure of the proposed communication scheme. The Scheduler creates a list of LLA function calls and forwards them to the associated Processing Elements. e. RISC/coprocessor communication approach with Dynamic Resource Scheduler The RISC transfers MLA function calls to the Dynamic Resource Scheduler. The Processing Elements send interrupt requests to the Scheduler after finishing their computation. Moreover. an hierarchical control approach reduces the communication overhead [12]. Aiming at a lower synchronization rate.A Modular Coprocessor Architecture 243 speed communication interfaces. communication latencies result in a high performance reduction. a sufficient communication structure between RISC and coprocessor is needed. some FPGAs contain embedded RISC processors [10].. Conclusions and an outlook to future work are given in chapter 5. MLA calls LLA calls interrupt requests Processing Element n Fig. Instead of calling and synchronizing single coprocessor micro instructions. Different Processing Elements can work in parallel if no data and resource conflicts occur. multiply-accumulate. 2 2. Subsequently. which includes a Dynamic Resource Scheduler for converting medium-level function calls (MLA) into a set of low-level function calls (LLA).

the Scheduler signals the application through interrupt request. Modular coprocessor architecture . The DMA Unit is controlled by the Dynamic Resource Scheduler. Synchronization of PEs is managed by a Dynamic Resource Scheduler instead of using semaphores [13].. On-chip memories can be used to reduce data transfer latencies. For the current design exploration phase. Coprocessor Host PC / RISC Control Interface MIB DMA Unit Internal Memory External MEM IF External Memory PE 1 . This allows HW/SW co-emulation during initial phases of application development to evaluate HLAs. the RISC has been supplemented by a Host PC. call has been processed. or in hardware if reduction of communication cost between RISC and coprocessor is highly important. Replacement and extension of PEs demand low development effort. The resulting coprocessor architecture is shown in Figure 3. The Dynamic Resource Scheduler can be implemented in software if saving of hardware resources is intended..2 Architecture Overview The coprocessor carries out LLAs computations. These autonomous working units are compact and have high potential for optimization. which is attached to an FPGA-based emulation system. PE m Fig. The proposed modular approach currently supports several dedicated Processing Elements executing different LLAs [11].244 H. a software approach is used. 2. Memory modules have been implemented for the proposed system bus with configurable data and address widths prior to logic synthesis. A Control Interface connects the RISC to the system bus of the coprocessor to allow access to all resources. A DMA Unit can be integrated if an application requires large amounts of data transferred between different coprocessor memories. and bus communication in detail. 3. In this work. Additionally. LLAs. Flatt et al. external memories like DDR-SDRAM can be accessed through external memory interface modules. Function calls and data transfers are performed via a central system bus.

hardware overhead is unavoidable. A slave may induce any arbitrary delay to a read operation as long as correct sequential order of responses is sustained. Common SoC busses like AMBA AHB Bus [7] and Processor Local Bus (PLB) [15] are powerful. A Reorder Scheduler on the Read Bus is used to keep in-order data delivery from slaves with different latencies. But both have multi-state protocols. and powerful bus system called Module Interconnect Bus (MIB) was developed. 3 3. . a small.1 Processing Element Design PE Example Application An exemplary processing element for implementation of an image classification algorithm based on a support vector machine (SVM) [17] is described in order to demonstrate the modular coprocessor architecture. The bus architecture is the most popular integration choice for SoC designs today. The main advantages of buses are flexibility and extensibility [14]. Figure 4 shows the structure of the Module Interconnect Bus. For the modular coprocessor architecture approach. Moreover. Control and data flow is managed by two bus arbiters. Both sub-busses allow multiple layers to provide independent parallel transfers. Input images of 64x64 pixels with 16 bit fix-point values per pixel are used and a total of 2520 reference images are available.A Modular Coprocessor Architecture 245 2. It allows rapid development of new bus components. All transfers are initiated by master modules while slave modules receive transfer requests. If full compatibility to the bus protocol is needed. these commercial bus systems are not suitable. Timing conditions of the bus protocol are simple. This decoupling allows integration of pipeline stages in both sub-busses. A main processing task of the algorithm is to compare the test image x with all reference images yj . Therefore. which is very suitable for complex SoCs running at high clock frequencies. The purpose of this algorithm is to classify a test image to a given set of classes. Read requests and write operations are transmitted through a Request/Write Bus. Valid bus transfers occur at every rising clock edge if the sending module asserts a valid signal and the receiving module is responding with an accept signal. Two temporally independent sub-busses are used for data transmission. Data read operations are sent over a Read Bus. The communication protocol of the MIB is based upon synchronous transmission with double handshake mechanism. the majority of applications only requires a small subset of the specified bus features [16]. which is used for inter-module communication . Commercial bus systems allow high speed communication between different units.3 Module Interconnect Bus (MIB) A key component in any System-on-Chip (SoC) design is the interconnection structure. which result in complex development and integration of new bus modules. flexible.

the core operations are executed in lines 5-8. ssdj = i (xi − yj. Read Bus BUF BUF Reorder Scheduler BUF BUF Devices Master 1 . Module Interconnect Bus architecture The analysis of the algorithm has shown that most of computation time is needed for calculating the sum of square differences function ssdj . Master n Request / Write Bus BUF BUF Req/Write Arbiter Slave 1 . After initializing the loop counters. (Ax+) LD Ry. 1: 2: 3: 4: 5: 6: 7: 8: 21: 22: 23: 24: 25: MOV Rj. (Ay+) SUB Rx. this pseudo code yields four cycles per loop. Ri. . . Rx.i ) 2 Figure 5 shows pseudo assembler code of the SSD function for an unoptimized RISC core.. Ry MAC Ra.. the code would take roughly 47M cycles to finish. Rx . only every fourth branch operation is counted. 4. For the given example of 2520 reference images with 64x64 pixels. . Pseudo RISC code of SSD . . #4096 ssdi_loop: LD Rx. Flatt et al. #4 BNZ ssdi_loop ST (Ar+). Rx. SUB Ri. Considering loop unrolling.246 H. Slave m BUF BUF Fig. Assuming one cycle per operation. #2520 ssdj_loop: MOV Ri. 5. Ra DEC Rj BNZ ssdj_loop x4 Fig.

Figure 7 shows the architecture approach. dedicated hardware can reduce number of clock cycles as follows.2 SSD Data Path Architecture Processing Elements carry out the LLA computations in the coprocessor. The level of maximal concurrency is limited by the data bus width of both external memory and system bus. Figure 6 shows a generic architecture of an autonomous PE. These pixels are subtracted from the corresponding pixels of the test image and squared afterwards. The test image x is compared with each reference picture yj . an MIB Master Interface for accessing external data. A function call comprises data memory addresses and defines function specific parameters. After finishing computations. To further increase hardware performance. Data parallelism is exploited in order to increase the computation performance. Reference image data is loaded via the MIB Master Interface and is processed by the PE as soon as available. MIB PE Slave + Control Master Internal Memory Data Path Fig. Loading from external memory is necessary only once if the image fits in internal memory. Afterwards the PE starts processing. Source data is taken from external memories via the MIB bus or directly from internal memories if available. 6. the PE sends an interrupt request to the Dynamic Resource Scheduler. four 16 bit pixels from a reference image are loaded in parallel. Generic Architecture of a Processing Element Performing a computation task requires that the Dynamic Resource Scheduler transfers function calls to the processing element via the MIB Slave Interface first. and a Data Path for performing computations. After computing the whole sum of square differences. It comprises an MIB Slave Interface. For this 64 bit example.A Modular Coprocessor Architecture 247 3. . The results are added by a tree of adders and accumulated in the last step. it is stored into internal memory. An Internal Memory can be integrated into the PE when needed. a Control Unit. a pipeline stage is inserted after each operation. For the exemplary algorithm.

The data width of the MIB was adjusted to 128 bit. CHIPit Gold Edition PRO DDR RAM FPGA 1 DDR RAM FPGA 2 Host PC Fig. the coprocessor architecture was partioned and mapped onto both FPGAs. xi 64 yj. . the ASIC verification system CHIPit Gold Edition Pro from ProDesign [18] was used. Figure 8 shows the system architecture. CHIPit is used for emulation only.i 64 Reg 16 16 16 16 16 16 16 16 Reg x + x Reg Reg Reg - + x + x + 16 ssdj Fig. example 64 bit 4 Verification and Results For demonstrating the efficiency of the modular coprocessor architecture. 7. Table 1 shows the synthesis results for coprocessors with different complexity. User software running on an Host PC can be used for executing high level algorithms and controlling the hardware mapped on the FPGAs. Sum of Square Differences architecture. CHIPit Gold Edition Pro architecture The CHIPit system comprises two Virtex II Pro XC2VP100-6 FPGAs. 8. Each of them is connected to 256 MB DDR RAM. Flatt et al.248 H. Real embedded processing demands a more area and energy efficient platform. Using two independent DDR RAMs allows simultaneous data transfers for two Processing Elements. Frequency decreases with increasing number of processing elements due to the more complex place-and-route process for the 128 bit multilayer bus system. In order to use both DDR RAMs. A 528 Mbps connection is provided for communication between Host PC and FPGAs.

Compared to the RISC. If several irregular low-level algorithms like image warping must be processed by different PEs. which is equal to SSD processing of sixteen 16 bit pixel per clock cycle.222 2x 128 bit SSD PEs 645k 16 5 Conclusion In this paper. The architecture reaches feasible speed even on FPGAs with frequencies up to 150 MHz on a Xilinx Virtex II-Pro. Currently. Therefore. Future work will be focused on interfacing an external RISC with the coprocessor architecture. Synthesis Results after Place and Route using one XC2VP100-6 FPGA #SSD PEs Frequency Slices Block RAMs Multipliers 0 156 MHz 5581 23 0 1 151 MHz 8173 35 8 2 147 MHz 9930 47 16 3 133 MHz 12936 59 24 4 125 MHz 14785 71 32 5 119 MHz 16449 83 40 Computing the SSD application example involves loading of all reference data from external memory. a full programmable solution might require less area. respectively. To show the capabilities of the embedded architecture. a modular coprocessor platform is presented. which is easy to extend or modify to support a large range of applications. It allows adding support for new applications without re-implementing all modules from scratch and can be used as a framework for dedicated hardware architectures.A Modular Coprocessor Architecture 249 Table 1. more complex algorithms and processing engines will be implemented. Table 2 shows the processing performance running the SSD algorithm on a RISC and a coprocessor containing two PEs. maximum processing performance is limited by the available external memory bandwidth [19]. The architecture approach is optimized for integration of dedicated application specific processing elements. Performance for 2520 SSD computations with 4096 pixels (16 bit) per image Platform Cycles Pixels / cycle RISC 47M 0. According to Amdahl’s law. the coprocessor is only accessible by a Host PC. The memory hierarchy of the demonstration system supports memory transfers of 256 bit per cycle. . the coprocessor needs 1/72 of the RISC clock frequency to achieve the same performance. speedup for the whole application is approximately 5 if 20% of the high level computations are remaining on the RISC. Table 2.

Dokl´dal. IEEE Transactions on Computers 52. A. EURASIP Journal on Applied Signal Processing 2005. Capperon. Burger.. Pirsch. pp. 142–145.. M. Stechele. 1015–1031 (2003) 3. http://www. Faulkner. pp. E. IEEE Transactions on Circuits and Systems for Video Technology 8. M. ProDesign: CHIPit Gold Edition Pro. Wahle. ARM: AMBA specification (rev. G.5 (2001) 16.. Lee. Jachalsky. In: IEEE Workshop on Signal Processing Systems SiPS97 Design and Implementation. Cyr. Dejnoˇkov´. 47–51. USA. John. J. Proceedings. MIT Press. Nu˜ez.. Los Alamitos (2003) 15. IEEE Computer Society Press. Talla. pp... Ding.com 9.: Learning with Kernels.: Embedded real-time architecture for level-set-based z a a active contours. Los Alamitos (2000) .. Xilinx: Xilinx website. G. Los Alamitos.: Generation of processor interface for SoC using standard communication protocol. Los Alamitos (2001) 4. K. Proceedings. 2788–2803 (2005) 14. Goossens. Kennedy.: ASIC and DSP implementation of channel filter for 3G wireless TDD system. M.. W. 878–891 (1998) 5. IEEE Computer Society Press.. Aboulhamid. W. Singh. Los Alamitos (2004) 13. IEEE Computer Society Press. Gehrke. Proceedings. P... H.. 332–335. W.: Embedded software in real-time signal processing systems: application and architecture trends.: VLSI implementations of image and video multimedia processing systems. DC. 181. L.: On-chip communication architectures for reconfigurable system-on-chip. IEEE Computer Society Press.. C. J. D. N. Wahle. R. R. In: Proceedings of the IEEE..J. Liem.uchipit.. M. F.. In: IEEE International Conference on Field-Programmable Technology.0) (1999) 8. In: 14th Annual IEEE International ASIC/SOC Conference. Pirsch. http://www. Cornero. Sch¨lkopf. IEE Proceedings . Gehrke.: Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements.: A core for ambient and mobile intelligent imaging applications. Cambridge (2002) o 18.. Bois. Jachalsky. T¨bingen u (2005) 12.altera.: Multimedia extensions for general-purpose processors. Lee. P. B.com 10.. Hinz. Kruijtzer. In: IEEE International Symposium on Consumer Electronics. S. http://www. A.. 9–23. Altera: Altera website..: The memory bandwidth bottleneck and its amelioration by a compiler. Vejanovski. In: Workshop on Hardware for Visual Computing. CDROM (2003) 6. IEEE Computer Society Press.250 H. pp. IBM: 64-bit processor local bus architecture specifications..: A coprocessor for intelligent image and video processing in the automotive and mobile communication domain. P. C. W. Los Alamitos (1997) 7.. P.: Reconfigurable hardware acceleration for video-based driver assistance. D. Los Alamitos (1997) 2. 367–376 (2004) 17. Stolberg. IEEE Computer Society Press. 2. vol. 419–435. S. IEEE Computer Society. Proceedings. J. Xilinx: Virtex-II Pro and Virtex-II Pro X platform FPGAs: Complete data sheet (2005) 11. Herrmann. Flatt et al. References 1.. p. In: n IEEE International Conference on Multimedia & Expo (ICME). Paulin. Washington. A. In: 14th International Symposium on Parallel and Distributed Processing (IPDPS).. Bergmann.Computers and Digital Techniques 151.. 85.xilinx. Proceedings. Smola. pp. P. Nacabal. M.. G. T.com 19. version 3. Pirsch.