2012 International Conference on Computer Science and Electronics Engineering

Study of High Speed Parallel Image Processing Machine
Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China Graduate School of Chinese Academy of Science, Beijing 100039, China e-mail: taoleiyan@yahoo.com.cn ZHOU Jin Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: zhoujin@163.com
Abstract—As high resolution and large field sensors were used, the real time performance of the target tracking system is more and more important. Single mode algorithm can not track the target stably because of the target changes its characteristics from time to time. In order to solve the two problems above, a multi-processors and parallel high-speed image processing machine was presented. The kernel method for designing the parallel image processing system was analyzed and the optimization technologies for DSP were discussed. The FPGAs were used for detecting targets in whole view of sight and the DSPs were used for target tracking. The machine realized can automatic detect and track targets in real-time. The experimental results indicate that the system can adapt various environments and various targets which have different characteristics. Keywords-multi-processors; real-time image processing; parallel processing; softerware optimization

Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: jpnl@yahoo.com.cn

WU Qin-zhang Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: wqz@163.com SRIO(Serial RapidIO) was used for connecting the multiprocessor for high-speed data transmission. The FPGAS were used for running low level algorithms of image process and the DSPS were used for high level algorithms of target association and tracking. II.



Real-time image processing involve processing vast amounts of image data in a timely manner for the purpose of extracting useful target information. The digital images are essentially multidimensional signals and are thus quite data intensive, requiring a significant amount of computation and memory resources for processing [1]. In the optoelectronic tracking system, the real-time performance gets more and more important as high resolution, large field sensors were used. Due to the request of multiple targets tracking in various different scenes, the traditional single mode tracking algorithm is no longer competent for this situation. The multi-mode tracking algorithm was put forward. In this multi mode tracking system, more than one algorithms running parallel at the same time to detect and track different kinds of objects such as small, mass, extent, slow and fast moving targets. So the traditional image processing machine based on single processor could not accomplish the task in time. To solve this problem, the highspeed image processing machine based on multi-processors was presented. In this new image processing platform, four high performance DSPs and four high performance FPGAs were used, the DDRII memory was exploited and the
978-0-7695-4647-6/12 $26.00 © 2012 IEEE DOI 10.1109/ICCSEE.2012.392 223

A. The request for real-time performance of the system Due to the request for high precision of the optoelectronic tracking system, the size of image gets larger and larger, and 12 or more bits of precision may be needed for higher levels of accuracy. The amount of data increases if color is also considered. For example, a typical application with the size of one frame is 2M bytes (1024x1024 and 16 bits of precision) at 100fps requires performing. With the trend toward higher resolution and faster frame rates, the amounts of data that need to be processed in a short amount of time will continue to increase dramatically. The key to cope with this issue is to exploit higher performance hardware platform and exploit the parallel image processing algorithms. B. Parallellism of the image processing algorithm Much of what goes into implementing an efficient image processing system centers on how well the implementation, both hardware and software, exploits different forms of parallelism in an algorithm, which can be data level parallelism (DLP) and instruction level parallelism (ILP) [2, 3]. DLP manifests itself in the application of the same operation on different sets of data, while ILP manifests itself in scheduling the simultaneous execution of multiple independent operations in a pipeline fashion. Traditionally, image processing operations have been classified into three main levels, namely low-level, intermediate-level, and highlevel, where each successive level differs in its input/output data relationship [4, 5]. Low-level operators take an image as their input and produce an image as their output. It can be further classified

Interconnection of multi-processors Three kernel components which decide the performance of real-time image processing system are processor nodes. Due to their irregular structure and lowbandwidth requirements. High-level operations interpret the abstract data from the intermediate-level. In general. B. memory. and global operations. The band width of the PCI bus is 32 bits and work at 33MHZ. These types of operations are usually characterized by control or branch-intensive operations. Through PCI bus and the SRIO. The C6455 used is a high performance DSP whose main frequency is 1. The performance of the CPU improved so fast that it is not a problem any more. performing high level knowledge-based scene analysis on a reduced amount of data. include segmenting an image into regions of interest. the maximized communication efficiency is realized in the image processing machine presented. 533-MHz (data rate) DDR2 Memory Controller of the C6455.into point. the operations are data-intensive.M11. Intermediate-level operations to get the mark result of targets after segmentation. contours. It will restrict the whole system performance if there is any deficiency in one component of the three.1 shows that every DSP linking another DSP through one 1x channel so every DSP can access the resource in another DSP on board. four FPGAs were used for low-level filtering. Preprocessing based on FPGA FPGA is the arrays of reconfigurable complex logic blocks with a network of programmable interconnection. The key to improve the whole system performance is to improve speed of memory and the performance of network connecting multi-processor. A.…M40 represent the data stored in memory. low-level operations are excellent candidates for exploiting DLP. a1. Fig. neighborhood. Intermediate-level operations transform image data to a slightly more abstract form of information. The SRIO module embedded in C6455 has four full duplex ports and every port supports the data transmission ratio up to 3. The performance of the parallel system realized through the network connecting multi-processor. extracting edges. D. Such operations include recognition of objects or a control decision based on some extracted features. lines. They can be programmed to exploit different types of parallelism inherent in an image processing algorithm. Because of the DSP occupies the PCI bus exclusively and there is so much bus arbitration that the communication efficiency is too low to satisfy the request of the task. requiring a high bandwidth for accessing image data.2GHZ and the 16 bit fixed-point processing capability is 9600MMAC/S. and network connecting the processors. The Fig. High performance processor node The image processing machine presented consist of four FPGAs Virtex5LX50T(V5) and four TMS320C6455(C6455). which are image data after filtering and the line table after target labeling. as is showed in Figure1. such operations are suitable candidates for exploiting ILP [6]. or other image attributes of interest such as statistical features. it is called loose coupling parallel system. Figure 1.125Gb/s. High speed system interconnection The parallel processing hardware platform can make full use of advantages only when the parallel structure is In the machine presented. relative flexibility for implementing more comprehensive algorithm than FPGAs. thus making them suitable candidates for exploiting DLP. III. Some intermediate-level operations are also data intensive with a regular processing structure.2 shows the whole view preprocess frame constructed by FPGAs. PARALLEL IMAGE PROCESSING MACHINE BASED MULTIPROCESSOR reasonable and it can make the accelerator approach N (N is total amount of the processors used in the system). the interconnection used shared PCI bus and SRIO. DSPs are well known for their highperformance. C. they are less data intensive and more inherently sequential rather than parallel. One FPGA is linked with one DSP through SRIO so there is high efficient data transmission band width between them. This in turn leads to highly efficient real-time image processing for low-level and Intermediate-level. In the image processing machine presented. The FPGAs have flexibility in implementing custom hardware solutions and have been used extensively for realizing the lowlevel image algorithm.a2…a7 represent the algorithm module 224 . Thus. The SRIO link realized the point to point interconnection between two processors. The maximum data rate achievable by the DDR2 memory controller is 2. every processor has its own data memory and linked through interface of communication. So its peak band width of data transmission is 132Mb/s. High speed DDRII storage Due to the ever-increasing gap between computation performance and memory. In this level. The M10. with highly structured and predictable processing.1 Gbytes/s. The goal by carrying out these operations is to reduce the amount of data to form a set of features suitable for further high-level processing. the ultimate memory DDRII SDRAM is used which interface to the 32-bit. SRIO is a kind of high performance serial interconnection technology for data transmission between point and point [7]. The network can be divided into two types: The first shares the bus or shares the memory which called close coupling parallel systemIn the second type. in the real-time image processing system presented.

which not only leads to a lower computational complexity but also to be faster. 2) Compiler optimization option Compiler provided by TI is equipped with automatic software optimizer. After the code profiled. In these four channels. they are: small target detecting algorithm. 9x9 cross high-pass filter. In these DSPs. greater gains in performance are obtained through optimization at the algorithmic level.realized in FPGA. very long instruction word(VLIW). one serves as the master handling system control. Fig. edge detecting module. especially on fixed-point processors. The frame of whole view of sight processing In the Fig. some special compiler optimization options should be applied to those functions that will experience an increase in performance. Max median filter in all directions.3 shows the task partition for these DSPs and the multi-mode image tracking algorithm has been realized for extracting different types of targets. While SIMD can be exploited. high-level parallelism and low-level parallelism. High speed parallel processing based on DSPs There are two kinds of parallel levels of high speed image processing based on DSP. Furthermore. In general. The SIMD means that the processor executing the instruction simultaneously on different portions of data in by the programmer. four DSPs are used. adaptive contrast enhance module. the time saved through such mechanisms would be completely wasted if there did not exist an efficient way to transfer data throughout the system. The task partition for multi DSPs Figure 2. the function calls should be replaced with in-line codes. running a RTOS for complex control-intensive operations. act as slaves to the main DSP. image store module are also implemented. namely single instruction multiple data(SIMD). several peripheral interface modules for image processing like camera link interface. So the direct memory access(DMA) are introduced. follows are some important optimization methods frequently used in image processing platform based on DSP. Their physical implement are running at 192MHz. In the parallel image processing machine presented this paper. parallel. 1) Algorithms level optimization To achieve real-time performance on a given hardware platform. 4) Substituting In-Line Code for subroutines Because the subroutines incurs overhead when called since variables have to be pushed onto a stack in order to be popped back upon return. different algorithms were running on them and the DSP4 acting as the main DSP. while SIMD and VLIW can help speed up the processing of diverse image operations. The fixed-point calculations are usually faster to perform than floating calculations on a fixed-point processor. It furnishes the ability to execute multiple instructions within one processor clock cycle. and an efficient memory subsystem. The high-level parallelism means that there are more than one DSP running at the same time in the system. mainly perform computationally intensive data processing operations. The low-level parallelism means there are three major architectural features that are essential to image processing system. DMA is a well-known tool for hiding memory access latencies. floating-point computations pose a major performance bottleneck in real-time implementations.2. especially for image data. thus allowing more computations to be performed in a shorter time [8]. These four algorithms running in the system at the same time and it can detect four kinds of targets which have different features. E. thus the noise and the false targets were discarded. The feature of the target for candidate extracted was calculated and recognized. 5x5median filter. 225 . which processing ability is more than 120fps for 1024x1024 16bit gray images. four objects detecting algorithms were realized. VLIW can be used for exploiting instruction level parallelism for speeding up high-level operations. So some modifications performed at the algorithmic level help to streamline the algorithm down to its core functionality. the algorithms often have to be optimized. SDRAM controller. target labeling module. mass target detecting algorithm. all running in allowing software-oriented pipelining of instructions Figure 3. high contrast target detecting algorithm and three frames difference detecting algorithm. 3) Substituting fixed-point for floating-point arithmetic More often than not. They are fast Histogram statistics module. There are many technologies can be taken to optimize the time costly portions in order to bring the execution time within an acceptable real-time range. every one FPGA constructs one channel and every one channel is used for detecting one kind of targets in whole view of sight. Other DSPs.

” Electronics & Communication Engineering Journal.” in Handbook of Image & Video Processing.5 show the robust tracking results from the high-speed parallel image processing machine presented. RESULT OF PARALLEL IMAGE PROCESSING MACHINE Fig. Loop unrolling can be used to increase the ILP of the loop and it allows performing multiple iterations in one pass. and Synthesis for Embedded Systems. The experimental results indicate that the system can adapt various environments and various targets which have different characteristics and the time cost is acceptable. Bovik. Moreno. V. The frame processed with size 320x256-1024x1024 and one pixel of the image is 16 bits. Rapid IO Interconnect Specification Rev. Fang. Fang. pp. W. http://www. Ji. October/November 2003. Hu. September 2001. The table Tab. Broers.2[OL]. “A New Look at Exploiting Data Parallelism in Embedded Systems. Frames from a car tracking sequence on a complex ground background. Ji. H. 139–151. H. In this image sequence. No. Vol. H.1 shows some results of the software module after optimization based on technology introduced above. September 2001. June 1998. Crookes. it was not influenced by the feature changes of target and the complex background. Hunter and J. Due to the multi-mode tracking algorithm. Dong.18ms 0. Z. Binary morphologic filtering Gray 5x5 morphologic filtering 5x5 average filtering 3x3 sobel edge filtering Binary statistical filtering Contrast liner enhancement 4ms 16ms 6ms 3. REFERENCES [1] A.8ms 25ms 0. the tracking is stably all the time. and B. we can see that although the background is complex. 226 .5 shows tracking result of aeromodelling. Amsterdam: Elsevier Academic Press.” Proceedings of the International Conference on Compilers. 2002-10-14.18ms 0. it is possible to process four 8-bit data or two 16-bit data pixels simultaneously. 2005.1. 6) Loop transformations It is often a good practice to first apply loop transformations in the high-level language before applying assembly optimizations. 159–169.” Proceedings of the European Optical Society Conference on Industrial Imaging and Machine Vision.2ms 0. 3. Four kinds of algorithm for target detecting based on FPGAs and four kinds of algorithm for target tracking based on DSPs have been realized on the multi-mode tracking system.8ms 0. RapidIO Trade Association. CONCLUSISON Ord 1 2 3 4 5 6 Time cost before and after optimization Name of software module Before opt After opt The high-speed parallel image processing machine presented runs with the speed of 100fps. A. “Introduction to Digital Image and Video Processing. C. Complex background. “A New Look at Exploiting Data Parallelism in Embedded Systems.8ms 1. Architectures. pp. A. TABLE I. the size of the object and the illumination were changing all the time. K. 159–169.” Proceedings of the Fourth International Workshop on Advanced Parallel Processing Technologies. It includes removing any unnecessary calculations within the loop itself. Hunter and J. After the loop unrolling. Under this situation. pp. and B. THE TIME COST OF SOFTWARE MOUDLE BEFORE AND AFTER OPTIMIZATION IN THE PARALLEL IMAGE PROCESSING MACHINE Figure 5. June 2005. Because most modern data paths are of 32 bits. “Research on Architectures for High Performance Image Processing. Ed. P.M.06ms 1. “Parallel Architectures for Image Processing. Downton and D. leading to processing more image data in one clock cycle. The size of the region tested is 256x256 and one pixel is 16 bits. It is being tracked automatically by our parallel tracking system. Hu. Bovik. “Research on Architectures for High Performance Image Processing. Dong.M. The cloud and the building on the background were also challenges. Kleihorst.rapidio. “Architecture Study for Smart Cameras. and Synthesis for Embedded Systems. Jonker. From the Fig. 39–49. October/November 2003.4. Z.” Proceedings of the Fourth International Workshop on Advanced Parallel Processing Technologies. replacing the function calls with in-line codes and the loop unrolling. [4] [5] [6] [7] [8] Figure 4. pp.5) Packed data processing The C6455 incorporate some form of packed data processing functionality to enhance the performance of processing data that is smaller than the width of the data path. the software pipelining scheduled by the compiler and it improves performance of the loop distinctly. Architectures. Caarls.” Proceedings of the International Conference on Compilers. grouping more computations together for simultaneous access to more data and thus reducing loop overhead due to branching and helping the compiler to create a more efficient scheduling of the main loop body.4 and Fig. K. Moreno. and R. the three frames difference algorithm was scheduled automatically in our image processing machine. target size and illumination changed The Fig.7ms [3] [2] IV.org. the tracking would fail quickly. 10. if there were no multi algorithms in system.