You are on page 1of 8

Parallel Implementation of Compressed Sensing Algorithm on CUDA- GPU

Kuldeep Yadav1, Ankush Mittal 2
Computer Science and Engineering College of Engineering Roorkee Roorke-247667, INDIA Kul82_deep@rediffmail.com1 dr.ankush.mittal@gmail.com2
1, 2, 4

M. A. Ansar3, Avi Srivastava4
Galgotia College of Engineering Gr. Noida, INDIA ma.ansari@ieee.org3 aviarodonix12@yahoo.com4

Abstract - In the field of biomedical imaging, compressed sensing (CS) plays an important role because compressed sensing, a joint compression and sensing process, is an emerging field of activity in which the signal is sampled and simultaneously compressed at a greatly reduced rate. Compressed sensing is a new paradigm for signal, image and function acquisition. In this paper we have worked on Basis Pursuit Algorithm for compressed sensing. We have computed time for running this algorithm on CPU with Intel® Core™2Duo T8100 @ 2.1GHz and 3.0 GB of main memory which run on Windows XP. The next step was to convert this code in GPU format i.e. to run this program on GPU NVIDIA GeForce series 8400m GS model having 256 MB of video memory of DDR2 type and bus width of 64bit. The graphic driver we used is of 197.15 series of NVIDIA. Both the CPU and GPU version of algorithm is being implemented on the Matlab R2008b. The CPU version of the algorithm is being analyzed in simple Matlab but the GPU version is being implemented with the help of intermediate software JACKET V1.3. For using Jacket, we have to make some changes in our source code so to make the CPU and GPU to work simultaneously and thus reducing the overall computational acceleration of the algorithm. Graphic Processing Units (GPUs) are emerging as powerful parallel systems at a cheap cost of a few thousand rupees. We got the speed up around 8X, for most of the Biomedical images and six of them have been included in this paper, which can be analyzed via the profiler.

The basis principle is that sparse or compressible signals can be reconstructed from a surprisingly small number of linear measurements, provided that the measurements satisfy an incoherence property. Such measurements can then be regarded as a compression of the original signal, which can be recovered if it is sufficiently compressible. A few of the many potential applications are medical image reconstruction [5], image acquisition [6], and sensor networks [7]. The first algorithm presented in this context is known as basis pursuit [8]. Let Φ be an M × N measurement matrix, and Φx = b the vector of M measurements of an N dimensional signal x. The reconstructed signal u∗ is the minimizer of the L1 Norm subject to the data min ||u||= 1, subject to Φu = b .A remarkable result of Candes and Tao [9] is that if, for example, the rows of Φ are randomly chosen, Gaussian distributed vectors, there is a constant C such that if the support of x has size K and M ≥ CK log (N/K), then the solution will be exactly u∗ = x with overwhelming probability. The required C depends on the desired probability of success, which in any case tends to one as N →∞. Donoho and Tanner [10] have computed sharp reconstruction thresholds for Gaussian measurements, so that for any choice of sparsity K and signal size N, the required number of measurements M to recover x can be determined precisely.In this study, we implemented Basis Pursuit Algorithms, on NVIDIA’s GeForce 8400 GPU with the Computer Unified Device Architecture(CUDA) programming environment. Hence, we have chosen to implement it and we hope that other GPGPU researchers in the field will also make the same choice to standardize the performance comparisons.

Keywords — Compressive sensing, Basis Pursuit Algorithms, Jacket v 1.3, GPU, medical image processing; high performance computing.

I. INTRODUCTION Current papers [1, 2, 3, and 4] have introduced the concept known as compressed sensing (among other related terms).

112

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

II. BACKGROUND In this section, we have discussed about the Jacket v 1.3 which is a Graphics Processors for general purpose computing. Over the past few years, specialized coprocessors from floating point hardware to field programmable gate arrays have enjoyed a widening performance gap with traditional x86  based processors. Of these, graphics processing units (GPUs) have advanced at an astonishing rate, currently capable of delivering over 1 TFOPS of single precision performance and over 300 GFLOPS of double precision while executing up to 240 simultaneous threads in one low  cost package. As such, GPUs have gained significant popularity as powerful tools for high performance computing (HPC) achieving 20100 times the speed of their x86 counterparts in applications such as physics simulation, computer vision, options pricing, sorting, and search. As with previous research compressed sensing studies based on Graphics Processing Units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on compressed sensing Since the GPU which we have taken (NVIDIA 8400M GS) is the most basis model has high portability and is easily available in present day laptop and desktops so can be implemented directly. However synchronizing of host and device with suitable parallel implementation is the most challenging part. Which has been parallelized by us? Basis Pursuit Algorithms cannot be parallelized straight forward because of distribution part, so our solution provides balanced and data distributed parallelization framework of Basis Pursuit Algorithms on CUDA without compromising numerical precision. We have broken the process in threads of blocks and managed those threads inside a special thread managing hardware called as GPU with the help of the environment and set of libraries provided by CUDA.

A. Jacket Overview Jacket connects Matlab to the GPU. Matlab is a technical computing language that integrates computation, visualization and programming in an easy  to  use environment that has found wide popularity both in the industry and academia. It is used across the breath of technical computing applications including mathematical computations, algorithm development, data analysis, data visualization and application development. With the GPU as a backend computation engine, Jacket brings together the best of three important computational worlds: computational speed, visualization, and the user friendliness of M programming. Jacket enables developers to write and run code on the GPU in the native M language used in Matlab. Jacket accomplishes this by automatically wrapping the M language into a GPU compatible form. By simply casting input data to Jacket’s GPU data structure, Matlab functions are transformed into GPU functions. Jacket also preserves the interpretive nature of the M language by providing realtime, transparent access to the GPU compiler. B. Integration with Matlab Once Jacket is installed, it is transparently integrated with the Matlab’s user interface and the user can start working interactively through the Matlab desktop and command window as well as write M-functions using the Matlab editor and debugger. All Jacket data is visible in the Matlab workspace, along with any other Matlab matrices. C. GPU Data Types

Jacket provides GPU counterparts to MATLAB’s CPU data types, such as real and complex double, single, uint32, int32, logical, etc. Any variable residing in the host (CPU) memory can be cast to Jacket’s GPU data types. Jacket’s memory management system allocates and manages memory for these variables on the GPU automatically, behind  the scenes. Any functions called on GPU data will execute on the GPU automatically without any extra

113

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

programming, GPUfunction Jacket provides the largest available set of GPU functions in the world, ranging from functions like sum, sine, cosine, and complex arithmetic to more sophisticated functions like matrix inverse, singular value decomposition, Bessel functions, and Fast Fourier Transforms. The supported set of functions continues to grow with every release of Jacket (see the Function Reference Guide), runtime of jacket is the most advanced GPU runtime in the world, providing automated memory management, compile  on  the  fly, and execution optimizations for Jacket  enable code, Jacket’s Graphics Toolbox is the only tool in the world that enables a merger of GPU visualizations with computation. With Jacket a simple graphics command can be added at the end of a simulation loop to visualize data as it is being computed while maintaining performance, The Developer SDK makes integration of custom CUDA code into Jacket’s runtime very easy. With a few simple SDK functions, your CUDA code can benefit from the optimized Jacket platform. When Jacket applications have completed the development, test, and optimization stages and are ready for deployment, the Jacket MATLAB Compiler allows users to generate license  free executables for distribution to larger user bases. (See the SDK and JMC Wiki pages) and Interactive help for any Jacket function is available using Jacket’s ghelp function. /* Pseudo Code of Basis Pursuit Algorithm */ 1. Load real time RGB image. 2. Assign it with double precision matrix. 3. repeat for each row of block 4. Using CUDA threads Decompose coloumnwise by basis pursuit algorithm. Find the decomposed matrix. End for. 5. Until all three colour block decompose. 6. repeat III. IMPLEMENTATION In this section, first, we introduce the general scheme for Basis Pursuit algorithm. Then, we introduce our GPU implementation environment by first discussing why GPUs are a good fit for medical imaging applications and then presenting NVIDIA’s CUDA platform and GeForce 8400 M architecture. Next, we talk about the CPU implementation environment. This is followed by description of the test data used in the experiments. Finally, we provide the list of CUDA kernels used in our GPU implementation. for each row of block 7. Parallely using CUDA threads Reconstruct coloumnwise by basis pursuit algorithm. Consider complex matrix also. Find the reconstructed matrix. End for. 8. until all three color block reconstructed. 9. Combine all three color block to give a matrix. 10. convert double precision of compressed Image data to unsigned int values. 11. Scale image data A. Basis Pursuit Compression Algorithm This algorithm takes an input as an image, reads it as a matrix. It then decomposes the image into blocks, then from blocks to many columns. Each of the columns is then processed and compressed. Each of the compressed columns is reconstructed into compressed blocks. Each compressed block is then reconstructed and sampled to get the final compressed image. In this algorithm there are six functions is used they are as bp_basis, bp_decompose, bp_construct, bp_block_decompose, bp_block_construct, imagproc. These functions are collectively used to decompose the image into blocks, compress the image, and reconstruct it by sampling it again. The general scheme of algorithm is shown below.

114

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

B. GPU Implementation Environment We have implemented a GPU version of Basis pursuit (BP) with NVIDIA’s GPU programming environment, CUDAv 0.9. The era of single-threaded processor performance increases has come to an end. Programs will only increase in performance if they utilize parallelism. However, there are different kinds of parallelism. For instance, multicore CPUs provide task-level parallelism. On the other hand, GPUs provide data-level parallelism. Depending on the application area, the type of the preferred parallelism might change. Hence, GPUs is good fit for all problems. However, medical imaging applications are very suitable to be implemented on GPU architecture. It is because these applications intrinsically have data-level parallelism with high compute requirements, and GPUs provide the best cost-perperformance parallel architecture for implementing such algorithms. In addition, most medical imaging applications (e.g. semi-automatic segmentation) require, or benefit from visual interaction and GPUs naturally provide this functionality. Hence, the use of the GPU in non-graphics related highly-parallel applications, such as medical imaging applications, became much easier than before. For to the graphics API. Since it is cumbersome to use graphics APIs for non-graphics tasks such as medical applications, instance, NVIDIA introduced CUDA to perform dataparallel computations on the GPU without the need of mapping the graphics-centric nature of previous environments. GPU- specific details and allowing the programmer to think in terms of memory and math operations as in CPU programs, instead of primitives, fragments, and textures that are specific to graphics programs. CUDA is available for the NVIDIA GeForce 8400 (G80) Series and beyond. The GeForce 8400 GS model having 256Mbytes of video memory of DDR2 type and bus width of 64bit l. The memory bandwidth of the GeForce 8400 GTX is 80+ GB/s. To get the best performance from G80 architecture, we have to keep 128 processors occupied and hide memory latency. In order to achieve this goal, CUDA runs hundreds or thousands of

fine, lightweight threads in parallel. In CUDA, programs are expressed as kernels. Kernels have a Single Program Multiple Data (SPMD) programming model, which is essentially Single Instruction Multiple Data (SIMD) programming model that allows limited divergence in execution. A part of the application that is executed many times, but independently on different elements of a dataset, can be isolated into a kernel that is executed on the GPU in the form of many different threads. Kernels run on a grid, which is an array of blocks; and each block is an array of threads. Blocks are mapped to multiprocessors within the G80 architecture, and each thread is mapped to single processor. Threads within a block can share memory on a multiprocessor. But two threads from two different blocks cannot cooperate. The GPU hardware performs switching of threads on multiprocessors to keep processors busy and hide memory latency. Thus, thousands of threads can be in flight at the same time, and CUDA kernels are executed on all elements of the dataset in parallel. We would like to mention that in our implementation; increasing the dataset size does not have an effect on the shared memory usage. This is because, to deal with larger datasets, we only increase the number of blocks and keep the shared memory allocations in a thread as well as the number of threads in a block the same. C. CPU Implementation Environment The CPU version of basis pursuit is implemented in Matlab with integration of jacket v1.3. The computer used for the CPU implementation is an Intel® CORE™2Duo T8100 @ 2.1GHz and 3.0 GB of main memory which run on Windows XP (SP2). The CPU implementation was executed on this box with both single threading mode and multithreading mode. Open MP is used to implement the multithreading part.

D. Test Data of (BP) Compressed Results

115

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Our test data consists of six MRI images which measure the performance of images on CPU & GPU with the help of profiler and has been shown in Figure 2, 3, 4 below. Show the profiler results for CPU and GPU based

implementations. It is clear from the figures that significant speedup is obtained for Basis Pursuit in GPU in comparison to CPU implementations.

(a) MRI1

(b) MRI 2

(c) MRI 3

(d) MRI 4

(E) MRI 5 Figure 1: Test Data Images Under Measure the Performance

(F) MRI 6

116

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Figure 2: Profile on CPU

Figure 3: Profile on GPU

5. Experiment and Results In this section, we first compare the runtimes of our GPU and CPU implementations for datasets with different sizes. And present our speedup. Then, we show visual results by providing slices from one of the datasets. Next, we provide the breakdown of GPU implementation runtime to the CUDA kernels and present GFLOP (Giga Floating Point Operations = 109 FLOP) and GiB (Gibi Bytes = 230 Bytes) instruction and data throughput information. This is followed by the discussion of the factors that limit our performance. Finally, we compare our implementation with Sharpetal’s implementation and highlight our

improvements. We measure the performance of images on CPU & GPU with the help of profiler and have been shown in figure 2 and 3 show the profiler results for CPU architecture and GPU based implementations. It is clear from the figures that significant speedup is obtained. We have achieved around 8X speedup over a single-threaded CPU implementation over GPU. We can calculate the speed up by using the following formula. Speed up=time taken to compress the image on CPU/time taken to compress image on GPU.

117

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Table1. Performance of GPU implementation with respect to CPU implementation

S. No.

Image

Size

Time to run program on CPU

Time to run BP_DECOMPOSE on CPU(sec)

Time to run BP_DECOMPOSE on GPU(sec)

Depth

1. 2. 3. 4. 5 6

MRI1 MRI2 MRI3 MRI4 MRI5 MRI6

256 X 256 256 X 256 256 X 256 256 X 256 256 X 256 256 X 256

118.52 136.47 139.56 98.87 120.34 138.47

105.2 123.49 125.159 84.25 106.1 124.50

14.06 20.31 18.19 11.90 16.31 21.02

512 1024 1024 512 512 1024

Figure 4: Comprarative Graphical Results on CPU& GPU

6. CONCLUSION The work presented in this paper is one of the major problems of clinical application of compressed sensing on bio-medical imaging which is its high computational cost by using currently available jacket software and GPU NVIDIA GeForce series 8400m GS model having 256M bytes of Video memory of DDR2 type and bus width of 64bit. We reduced the execution time from 50 seconds to 6 seconds for several bio-medical images. We obtained the speedup of the system 8x time.

118

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

REFERENCES
[1] [2] [3]
[[[[[

Rick chartrand and wotayin “Iteratively Reweighted algorithms for Compressed Sensing”, in IEEE 2008 ICASSP 2008. D. L. Donoho, “Compressed sensing,” IEEE Trans.On Information Theory, vol. 52, no. 4, pp. 1289–1306, April 2006. http //www.dsp.ece.rice.edu / cs.

[4] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, 2006. [5]. M. Lustig, J. M. Santos, J.-H. Lee, D. L. Donoho, and J. M. Pauly, “Application of compressed sensing for rapid MR imaging,” in SPARS, (Rennes, France), 2005. D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. F. Kelly, and R.G. Baraniuk, “A new compressive imaging camera architecture using optical-domain compression,” in Computational Imaging IV - Proceedings of SPIE-IS and T Electronic Imaging, vol. 6065, 2006. M. F. Duarte, S. Sarvotham, D. Baron, M. B. Wakin, and R. G. Baraniuk, “Distributed compressed sensing of jointly sparse signals,” in 39th Asilomar Conference on Signals, Systems and Computers, pp. 1537–1541, 2005. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp. 33–61, 1998. E. Cand`es and T. Tao, “Near optimal signal recovery from random projections: universal encoding strategies,” IEEE Trans. Inf. Theory, vol. 52, pp. 5406–5425, 2006.

[6]

[7]

[8]

[9]

[10] D. L. Donoho and J. Tanner, “Thresholds for the recovery of sparse solutions via L1 minimization,” in 40th Annual Conference on Information Sciences and Systems, pp. 202–206, 2006.

119

http://sites.google.com/site/ijcsis/ ISSN 1947-5500