## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

:

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

379

Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform

Xuguang Lan, Nanning Zheng, Senior Member, IEEE, Yuehu Liu

column. The approach needs large memory to store intermediate results. But the line-based architecture [7]-[9] minimizes the memory. W.Chang[9] describes a line-based architecture for 2-D DWT using lifting scheme, which consists of one row filter, one column filter and on-chip memory. It requires 9N storage cells for computing N×N image using 9/7 wavelet and O(N2) clock cycles(ccs). K.Andra[8] tries to generalize the lifting-based architecture, which consists of two row processors ,two column processors and two memory modules. But the memory control logic of the architecture is complex, and O(N2)ccs are needed for N×N image. G.Dillen [7] describes a combined line-based architecture for IDWT which needs tow row processors. It is necessary to increase the resource requirement to compute N×N image in O(N2/2)ccs. This paper describes a new line-based, low power, parallel and architecture for computing 2-D DWT/IDWT of JPEG2000 using lifting scheme, including (5,3) (the high-pass and the low-pass filter have five taps and there taps, respectively ), (9,7), (13,7), (2,2), (2,6), (2,10), (7,5), (10,18) and (6,10)[11]. Since the JPEG2000 part I is the core system of the standard, we focus on the configuration of the default filters, including (5,3), (9,7). The proposed architecture, which contains one row processor to compute along the rows and one column processor to compute along the columns, performs the multilevel decomposition of DWT and IDWT, one level at a time, in row-column fashion. The Multilevel DWT can be further optimized to speed up and reduce the memory of DWT coefficients by memory control. Both of the row and column processor consist of sub-filters, and each sub-filter is equivalent to one lifting-step of lifting scheme. The row processor is time-multiplexed. Two lines are computed at a time, and the horizontal and vertical filtering are performed in parallel. The edge extension is implemented by embedded circuit. The shift-add operations which utilize the common factor are substituted for the multiplications. The whole architecture is optimized by pipelined way, and control circuit is simple, computing N×N image in O(N2/2)ccs. This paper is organized as follows. Section II settles the theoretical concepts required for our architecture, describing the lifting scheme. Section III describes the DWT and IDWT architecture, including row processor and column processor. Experiment results are presented in section IV. The conclusion is summarized in section V. II. LIFTING SCHEME

Abstract A low-power, high-speed architecture which performs two-dimension forward and inverse discrete wavelet transform (DWT) for the set of filters in JPEG2000 is proposed by using a line-based and lifting scheme. It consists of one row processor and one column processor each of which contains four sub-filters. And the row processor which is time-multiplexed performs in parallel with the column processor. Optimized shift-add operations are substituted for multiplications, and edge extension is implemented by embedded circuit. The whole architecture which is optimized in the pipeline design way to speed up and achieve higher hardware utilization has been demonstrated in FPGA. Two pixels per clock cycle can be encoded at 100MHz. The architecture can be used as a compact and independent IP core for JPEG2000 VLSI implementation and various real-time image/video applications. Index Terms-2-D DWT, Parallel Architecture, JPEG2000, Lifting Scheme, VLSI.

I.

INTRODUCTION

Digital image compression engines are more and more needed to store or transmit large image/video data in order to satisfy the multimedia world. As a new generation image compression standard[1~3], JPEG2000 has many characteristics such as supporting progressive transmission by quality and resolution, and ROI (Region of Interest) and so on. This is due to the fact that discrete wavelet transform (DWT) is used for the spatial decomposition which supports the scalability of code stream. The DWT is implemented by the lifting scheme in JPEG 2000. Compared to the traditional convolution, computing complexity of lifting scheme[4][5] is reduced about by half. Two-dimension DWT is usually implemented directly in [6], that is, firstly computing the image along line, then along

This work was supported in part by the National High-Tech Program “863” of P.R.China under Grant No. 2002AA103011 and Creative Foundation of Nature Science No.60021302. Xuguang Lan is with the Institute of Artificial Intelligence and Robotics, Xi’an JiaoTong Unversity, Xi’an, P.R.China. (e-mail: xglan@aiar.xjtu.edu.cn). Nanning Zheng is with the Institute of Artificial Intelligence and Robotics, Xi’an JiaoTong Unversity, Xi’an, P.R.China. (e-mail: nnzheng@mail.xjtu.edu.cn). Yuehu Liu is with the Institute of Artificial Intelligence and Robotics, Xi’an JiaoTong Unversity, Xi’an, P.R.China. (e-mail: liuyh@aiar.xjtu.edu.cn). Contributed Paper Manuscript received December 3, 2004

Lifting scheme is called as the second generation wavelet[5] because of the construction of biorthogonal wavelets which does not require the Fourier transform. And lifting scheme,

0098 3063/05/$20.00 © 2005 IEEE

1.230174105 Fig.443506852 Inverse D W T K 2 α -1. Every finite wavelet transform can be factorized into lifting scheme[4]. Forward DWT: step1: Y(2n+1) = Xext(2n+1) +α×(Xext(2n) + Xext(2n+2)) i 0 − 3 ≤ 2 n + 1 < i1 + 3 step2: Y(2n) = Xext(2n) + β ×(Y(2n −1) +Y(2n +1)) i 0 − 2 ≤ 2 n < i1 + 2 step3: Y(2n+1) =Y(2n+1) +γ ×(Y(2n) +Y(2n+2)) i − 1 ≤ 2n + 1 < i + 1 step4: Y(2n) =Y(2n) +δ ×(Y(2n −1) +Y(2n +1)) i ≤ 2 n < i step5: Y ( 2 n + 1) = − K × Y ( 2 n + 1) i 0 ≤ 2 n + 1 < i1 step6: Y ( 2 n ) = Y ( 2 n ) / K i 0 ≤ 2 n < i1 Inverse DWT: step1: X (2n ) = K × X ext (2n ) i0 − 3 ≤ 2n < i1 + 3 0 1 0 1 ⎡ 1 ⎢δ (1 + z ) ⎣ 0⎤ ⎡ K ⎥ 1⎦ ⎢ 0 ⎣ 0 ⎤ ⎥.052980118 γ 0. changing the factor K to 1/K. That the row processor is time-multiplexed does not only increase the utilization. is factorized into lifting scheme. 2) Predict step. and g e ( z ) and g o ( z ) are the even part and odd part of high-pass analysis g ( z ) of the wavelet filter. is attractive for both high throughput and low-power applications.1. where the even samples are multiplied by the time domain equivalent of t (z ) and are added to the odd samples. 4) Scaling step. as shown in Fig. Two lines of Image data or LL subband data are routed in the Control Unit(CU). K is a constant. The function of the CU is to route the input data. VLSI ARCHITECTURE OF 2-D DWT The architecture of 2-D DWT is described in Fig. After the data are encoded by the row processor(RP) and column processor(CP) respectively. 3) Update step. for example. 51. respectively. i0 − 2 ≤ 2n +1< i1 + 2 i0 − 3 ≤ 2n < i1 + 3 i0 −2≤2n+1<i1 +2 m 0⎤ ⎡1 − ti ( z )⎤ ⎡1 / K 0 ⎤ ⎡ 1 P( z ) = ∏ ⎢ ⎥⎢ ⎥⎢ ⎥ −1 1 ⎦ ⎣0 K⎦ 1 ⎥⎣ 0 i =1 ⎣ − si ( z ) ⎦ where. Vol. Let polyphase matrix of the wavelet is From Fig. the DWT is factorized into i0 −1≤2n<i1 +1 i0 − 1 ≤ 2n + 1 < i1 i0 and( i1 -1)denote the index of the first sample and lifting scheme.3. fully in-place implementation of the wavelet transform. This leads to the wavelet implementation by means of banded-matrix multiplications. where the signal is split into even and odd points because the maximum correlation between adjacent pixels can be utilized for the next predict step.2. and control the next level decomposition according to the current level of decomposition and levels of decomposition required. MAY 2005 which leads to a faster. 5) z input X 2 Xo the last sample of one row or column.380 IEEE Transactions on Consumer Electronics. where the even samples are multiplied by 1/K and odd samples by K. z input X 2 Xo + α HP + γ HP K 1/K δ γ β α 2 Split β δ Merge X 2 Xe + LP Forward DW T + LP 1/K K δ 0. Therefore. ~ ~ he ( z) ~ ho ( z ) ~ ⎣0 1 ⎥ ⎣ β (1 + z ) ⎦ 0 ⎤ ⎡1 ⎥⎢ 1 ⎦ ⎣0 γ (1 + z − 1 ) ⎤ ⎥ 1 ⎥ ⎦ ~ ~ P( z ) = [ ~ ] ~ g e ( z) g o ( z) = ~ h e ( z) m ~ ho ( z ) ~ The dual polyhase matrix is given by[4] 0⎤ ⎡ K 0 ⎤ ⎡1 s ( z )⎤ ⎡ 1 ∏ ⎢0 i 1 ⎥ ⎢t ( z ) 1⎥ ⎢ 0 1/ K ⎥ i =1 ⎣ ⎦⎣ i ⎦ ⎦⎣ −1 And the lifting scheme is shown as follows. The CP outputs two pixels at one clock. The biorthogonal wavelet. 2.882911075 1. we can derive that the DWT and IDWT are symmetrical. and reversing the signs of coefficients in t (z ) and s (z ) . The lifting scheme of the wavelet filter computing one dimension signal is shown in Fig.2. the address generator generates the corresponding addresses of the output data of the CP. No. 1/ K ⎦ step2: X ( 2 n + 1) = − X ext ( 2 n + 1) / K step3: X(2n) = X(2n)−δ ×(X(2n−1)+ X(2n+1)) step4: X(2n +1) = X(2n +1) −γ ×(X(2n) + X(2n+2)) step5: X(2n) = X(2n) −β ×(X(2n −1) + X(2n+1)) step6: X (2n +1) = X (2n +1) − α × ( X (2n) + X (2n + 2)) Where. where the odd samples multiplied by the time domain equivalent of s (z ) are added to the even samples. The + t (z ) HP ~ K 1/K s( z) t (z ) 2 Split X s (z ) Merge 2 Xe + LP Forward DWT ~ 1/K K Inverse DWT 2 2 2 Downsample Upsample Fig. The IDWT is obtained only by traversing in the inverse direction. but also minimizes storage cells. Lifting Scheme . and the basic principle is to factorize the polyphase matrix of the wavelet or subband filters into a sequence of alternating upper and lower triangular matrices and a diagonal matrix using the Euclidean algorithm.586134342 β -0. factor 1/K to K. LL and LH or HL and HH subband data. The implementation of D’9/7 forward and inverse DWT is factorized into six-step lifting scheme.1. Lifting scheme of Daubechies 9/7 wavelet III. The polyphase matrix can be factorized into following formula by using Euclidean algorithm. and consists of four steps: 1) Split step. −1 P( z ) = ⎡ 1 α (1 + z ) ⎤ ⎡ ⎥⎢ ⎢ ~ 1 P( z ) = [ ~ ] ~ g e ( z) g o ( z) Where. h e ( z ) and h o ( z ) denote the even part and odd ~ ~ ~ part of low-pass analysis h ( z ) . Daubechies 9/7 wavelet filter[1].

In step2. NORMAL. δ . JPEG2000 standard employs the symmetric extension at the boundaries to eliminate it. Two compared images are goldhill and baboon. the coefficients of wavelet filters. PSNR(dB) 30 step 6 step 5 20 goldhill_float goldhill_fix baboon_float baboon_fix step 4 step 3 step 2 step 1 10 0 0 0. are generated in the state FORWARD. Periodic extension at signal boundaries . K and 1 / K are quantized into the formulation which consists of the number of bits with value ‘1’ in their positive representation. The other coefficients are processed in the same way. 13 bits for the fractional part and 11 bits for integral part. For example. and the size is 512 × 512 × 8bits. Firstly. The architecture employs 2’s complement fix-point data representation which has 24 bits. the effect of fix-point data can be ignored. We substitute the multiplications for shift-add operations to optimize the implementation. Embedded Periodic Extension at the Boundaries The finite length of signal processed by using wavelet filter leads to the edge effect[10]. That’s because each ‘1’ yields a term to be summed. B. δ =(0.443506852)10= (0.2 0. γ .3. Optimize multiplication using shift-add operation Because the multipliers occupy a great amount of area and hardware resources.0111000110001)2. for example. the extension points are not actually calculated. β . α . The forward extension signals ext_en1 and ext_en2. and the multiplication of δ and X is equivalent to the form δ × X = X >> 2 + X >> 3 + X >> 4 + X >> 8 + X >> 9 + X >> 13. The embedded scheme is implemented by finite state machine(FSM) and multiplexers. And the signals ext_en3 CONTROL Input LL subband data E v e n lin e U N IT O d d lin e ROW PRO CESSO R DWT E v e n lin e O d d lin e COLUM N L L /H L PRO CESSO R L H /H H ADDRESS AD DRESS GENERATOR L L .6 0. which are employed to enable the multiplexer in sub-filters β and δ of the RP and CP respectively. δ × X = X >> 2 + X >>13+ ( X >>1 + X >> 2) >> 2 + ( X >>1 + X >> 2) >> 7 . The extension of signal is embedded into the row processor and column processor in the implementation. in other words. The Memory Control unit is not the purpose of this article and will not be discussed.4. In p u t im a g e d a ta The multiplication is optimized by using shift and add operations in this way. as shown in Fig. the original computed sample 1 is added by itself to update the even sample 0.2 4 3 2 1 Rate(bpp) Fig. The Architecture of 2-D DWT A. is simplified as follows.: Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform 381 address generator produces addresses. because the computations of extension points are the same as those of original image samples. 5b. Lan et al. Multiplexers can be utilized to route the two input points for the adder which values are equal to those of original samples in each lifting step. and only the odd original samples are predicted. FORWARD extension. the forward extension samples signaled number 4. for example.5.8 1 1. see Table I and Fig. which are not appropriate to chip design. 3. and the output data are stored in memory by the memory control. Then the common factor is found. PSNR and Rate 1 0 U p d a ted p o in t 2 3 4 O r ig in a l p o in t 5 3 4 U p d a ted p o in t 2 (b)Embedded scheme of periodic extension(only compute the samples in rectangle) Fig. as shown in Fig. 40 E D C B A B C D E F E D C ile ft i0 i1 i r i g h t (a) Periodic symmetric extension. 2 and 1 at the left of the sample 0 are not computed in step1.H H M EM ORY CONTROL M EM ORY Fig.L H .4. Compared to the float data. LAST extension.4 0. The triggering signals of transition of the states are produced according to the line enable wire of pixels. In Fig. The above equation.5a. The FSM has four states: IDLE.7 and 8.X.H L .6. The architecture does not only perform high-pass and low-pass filters of the RP and CP in parallel. and the coefficients of wavelet filter are constants. but also horizontal and vertical decomposition in parallel. The scale and power of circuit is greatly reduced in this way.

as shown in Fig.δ and 0 M ultiplexer R egister 1 + A dder << + S hift-A dd. Filtering of the image along the line is finished in this way. R is ter eg A Q1 A Register Q1 A Register Q1 A Reg er ist Q1 A R is ter eg Q1 the sub-filters α and β . Therefore. X0. 2. All the hardware resources of the row processor can be time-multiplexed. add X1. The same optimization operations as sub-filter α are utilized. for example.7 and 8. hardware resources are greatly cut down and utilization is high. That is. compute the multiplication SA-0= α × (X0. and even samples of each line are summed by utilizing the delay flip-flop (DFF) and multiplexers.2+X0. Multiplication SA-1= α × (X1. Finite state machine(FSM) of extension C. The Architecture of the Row Processor sel_en Q8 R s ter egi R ist e eg r Q 1 A Q1 A Re gister Q 1 A R ist er eg Q 1 A R ist er eg Q1 H Q8 H Q8 H Q8 H Q8 H are similar to the sub-filters γ and δ . respectively. add X0. The schedule of the sub-filter α is shown in Table III. Sub-Filter β R egister A Q 1 A R ist er eg Q1 A Register Q 1 A R is ter eg Q1 A Register Q 1 sel_en Q 8 R ister eg R ist e eg r Q 1 A Q 1 A R iste eg r Q1 A R is ter eg Q1 A R s ter egi Q 1 H Q 8 H Q8 H Q 8 H Q8 H Even Line A sel_en sel_en 0 1 0 1 Register A Q 1 H Q 8 EN B ENB EN B H Q 8 H Q 8 H Q8 H Q8 H Q 8 EN B E NB E NB sel_en 0 1 0 sel_en 1 EN B ext_en4 ext_en4 R ister eg γ Q1 Q8 EN B Even Line sel_en 0 1 Odd line Register A Q 1 A R egister Q 1 0 1 01 01 R iste eg r A Q1 H Q8 ext_en2 sel_en 0 1 ENB δ Q1 Q8 ENB sel_en 0 1 Q 1 Q 8 0 1 Register 1/ K Even Line R iste eg r + R is e r eg t A H E NB << + A Q1 H Q8 + EN B ENB 0 1 sel_en + 01 ext_en2 R is ter eg A Q 1 Q 8 H Q 8 R is ter eg A Register H << + A H + A Q 1 H Q 8 EN B EN B ENB 0 1 sel_en K Odd line R s ter egi A Q 1 A Register Q 1 A Regist er Q1 Odd line A Q 1 R is ter eg A Q1 A Regist er Q1 H Q 8 H Q 8 H Q 8 H Q8 EN B E NB EN B EN B ENB H Q 8 H H Q8 H Q8 EN B EN B ENB ENB ENB (c) Prediction step2. For instance. Fig. The sub-filters α and β Even Line A sel_en sel_en 0 1 0 1 R ist er eg A Q1 H Q8 ENB EN B EN B H Q 8 H Q8 H Q 8 H Q 8 H Q8 EN B E NB E NB ENB sel_en 0 1 0 sel_en 1 R iste eg r ext_en3 ext_en3 R s ter egi R ist e eg r Q 1 A Q1 01 01 Re gister A Q 1 Q8 H Q 8 α Q1 Q8 ENB Even Line sel_en 0 1 Odd line R s ter egi A Q1 A Register Q 1 A Register Q 1 H Q8 H Q 8 H Q 8 ext_en1 sel_en 0 0 1 1 ENB β Q1 Q8 ENB sel_en 0 1 Q1 Q8 0 1 Register Even Line + R egister A H EN B << + A Q 1 H Q 8 + ENB ENB 0 1 sel_en + 01 ext_en1 Reg er ist A R is ter eg H << + A H + A Q 1 H Q 8 EN B ENB ENB 0 1 sel_en Odd line Reg e ist r A Q1 A Register Q 1 A Register Q 1 Odd line A Reg er ist A Q1 A R ist er eg Q1 H Q 8 H H Q8 H Q 8 H Q 8 EN B E NB E NB E NB ENB EN B H Q8 H Q8 EN B EN B EN B ENB ENB (a) Prediction step1. Compared to the first input line (even line). 1 is the odd one.6. Xi. respectively. The two lines of output samples of the sub-filter α are routed into sub-filter β by the multiplexers. and i is line number. the proposed row processor is time-multiplexed at the cost of several registers.2 to X1. Two lines pixels are input firstly into the sub-filter α in the row processor. γ and δ .2. respectively.7. And the odd samples are summed by using the DFF.7. which triggers the multiplexers to route for adders. expresses the third sample of the first line.γ .2 to X0. Two lines are calculated at a time . The control signal sel_en of the router. TABLE II CONTROL SIGNAL OF THE EXTENSION OF FORWARD AND BACK END coefficient α of the sub-filter α (which is optimized to shift-add operation). Time-multiplexing the row processor is implemented by computing the even lines at odd clocks and odd lines at even clocks.0) and store the results in registers. These signals are shown in Table II. Pipelined computations are achieved via the registers. see Fig. This reduces storage cells and increases the speed in row processor. are produced in the state of LAST extension.0 and store the sum in register. The even and odd IDLE Ext_en1 Ext_en2 Ext_en3 Ext_en4 0 0 0 0 First_ext 1 1 0 0 normal 0 0 0 0 Last_ext 0 0 1 1 re s e t Id l e FO RW ARD LAST L a s t_ en I n it i a l_ e n NORMAL N o r m a l_ e n Fig. The row processor is optimized in the pipelined way. Vol. and 0 denotes the even line of two input lines.0. β .0) is computed at clock 5. denotes the multiplication of the coefficients of wavelet filter with the summation at previous clock. The summations are multiplied by the coefficient β . and the control logic is simple. and odd line pixels in even clocks. K are coefficients of w avelet filter Fig. is generated by the counter. the products are added to odd samples of each line. and the samples are encoded continuously as the samples are input.j denotes the sample of line i and column j. added to even samples. 51. No. MAY 2005 ileft iright odd 1 3 i0 5/3 9/7 even 2 4 i1 even 1 3 odd 2 4 ileft denotes the extension iright denotes the extension samples towards right samples towards left and ext_en4 of last extension. At clock 4. Row Processor of the Forward DWT . the second input line(odd line) is delayed to compute by one clock. SAi. Sub-Filter α (b)Update step 1. Hardware utilization reaches approximately 100%. Sub-Filter δ α . β .2+X1. at the same time. which are applied to enable the multiplexers in sub-filters α and γ respectively.7. After the summations are multiplied by the α The row processor consists of four sub-filters . two pixels are encoded at one clock.382 TABLE I PERIODIC EXTENSION OF WAVELET AT THE BOUNDARIES IEEE Transactions on Consumer Electronics. Sub-Filter γ Re gister A Q1 H Q8 EN B (d) Update step 2. Compared to the row processor that the input lines are partitioned into even and odd samples (which needs tow parallel RPs). at clock 3. Even line pixels are computed in odd clocks. and every line pixels are not divided into even and odd samples when they are input into the row processor continuously. Computations of the sub-filters γ and δ are similar to those of samples of each output line in sub-filter δ are multiplied by 1 / K and K .

0 of the sub-filters β . This reduces the storage cells and simplifies the design of column processor. The even and odd samples of each output line in sub-filter δ are multiplied by 1 / K and K . Lan et al. At clock 2. Firstly.0 X1. The schedule of sub-filter α is shown in Table IV. The computations of the sub-filters α and β are similar to those of the sub-filters γ and δ . Modularity Two lifting steps are required for (5. Step1: Y(2n+1) = Xext(2n+1) −⎢Xext(2n) + Xext(2n+2)⎥ ⎢ ⎥ 2 ⎣ ⎦ Y(2n−1) +Y(2n+1) +2⎥ .0 X2.4 X1. Sub-Filter γ (d) Update step 2. We also optimize the column processor in pipelined way to increase the speed. the pixels are naturally separated into even samples and odd samples along the column. At clock 4.0 - sub-filter α Adder Shift-Add E. add SA1 to X1. Sub-Filter β Even line Register Register Q1 A Q1 A Register Q1 line buffer 01 A Even line ext_en2 ext_en2 0 1 Register A Q1 A Register Q1 H Q8 H Q8 H Q8 H Q8 H Q8 ENB γ Q 1 ENB ENB ENB δ Q1 ENB Register ext_en4 ENB ENB ENB 1 + A H Q 8 + 0 << Register A Q1 Register Register Q1 H Q8 + A O line dd H Q8 + A << Register A Q1 Register H Q8 ENB + H Q8 + A Q1 1/K Even line H Q8 ENB ENB Register Register Q 1 A Q1 A O line dd Register Register Q1 A Q1 A Register Q1 A line buffer H Q 8 H Q8 line buffer H Q8 H Q8 H Q8 K O line dd ENB ENB ENB ENB ENB D. only eight adders and six shift-add operations are needed to implement row filter of D’9/7. two lines are performed in sub-filter α . At clock 3. LL and LH subband pixels or HL and HH.0 X0. only eight adders and six shift-add operations are needed. N is image width.4+X1. are computed in sub-filters β .0 and store the sum in registers. and the column processor is performed in parallel with the row processor (only delay one line clock cycles).0+X4.8.0+X6.4 SA-0 SA-1 SA-0 SA-1 SA-0 SA-0+X0. Therefore.0 X8.8. The schedule (c) Prediction step2. The Architecture of the Column Processor The column processor performs wavelet transform along the column. SA1= α × (X2.1 SA-0+X0.0+X0. Where. and the pixels processed are from the row processor. This leads to the difference of column processor from row processor.0 SA2+X3. even line and odd line.2+X1. γ and δ is similar to that of the sub-filter α .6+X0. In Fig. the function of the line buffer is identical to the FIFO (First in. Sub-Filter δ Fig. the proposed architecture of the row processor greatly reduces the hardware consumption and complexity.6+X1. sub-filter α Adder Even line Register A Q1 H Q8 EN B line buffer ext_en3 01 Register A Q1 A Register Q1 A Register Q1 Even line ext_en1 ext_en1 0 1 Register A Q 1 A Register Q1 H Q8 H Q8 H Q8 H Q 8 H Q8 EN B Shift Adder α Q1 EN B EN B EN B β Q1 EN B Register EN B EN B EN B 1 + A H Q8 + 0 << Register A Q1 Register H Q8 + A Q1 Odd line Register H Q8 + A << Register A Q1 Register H Q8 EN B + H Q8 + A Q1 Even line H Q8 EN B EN B Register Register Q1 Register Q1 A Q1 Register Register Q1 A Q1 A Register Q1 Odd line A H Q8 line buffer A H Q8 H Q8 line buffer A Odd line H Q8 H Q8 H Q8 EN B EN B EN B EN B EN B EN B (a) Prediction step1. Two output lines of the row processor are even line and odd line for the column processor.0 to X0.0 X4. are substituted for the registers of the row processor. add X2. read the sum from the registers and calculate the multiplication.0+X8. a r e o u t p u t a t o n e c l o c k i n c o l u m n p r o c e s s o r.X.8.0). which buffer the samples required by column processor. as shown in Fig. Filtering along the line is finished in this way. respectively.: Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform 383 From Fig. TABLE III SHEDULE OF THE SUB-FILTER α IN RP Clock Adder 1 2 3 4 5 6 7 8 X0.0+X0.2+X0.7.0 X10. Column Processor of the Forward DWT TABLE IV THE SCHEDULE OF THE SUB-FILTER α IN CP Clock Adder 1 2 3 4 5 6 7 X0.4+X0.3) wavelet[1] as follows. The column processor begins to calculate the samples after the first two lines finish computing in row processor. because shift registers could be implemented by the combination logic of wire AND operations. where W is the width of data path. Two pixels. The shift registers don’t increase the hardware consumption.0 X6. γ and δ in turn.1 SA-1+X1. and the size is no relation to image height. we need only utilize the sub-filters α and β of the row and column processors in D’9/7 to perform the transform except replacing the coefficients of D'9/7 wavelet .3 X0.2 X1. for example. Line buffers.0 and store the sum in registers.2 X0. Six-line buffers for D’9/7 and there-line for 5/3 are Required.0 SA1 SA2 SA3 SA4 SA1+X1.3 SA-1+X1. in other words. Step2: Y(2n) = Xext(2n) +⎢ ⎢ ⎥ 4 ⎣ ⎦ Therefore. respectively.0 SA3+X5. First out) which size is W×N .0+X2. Sub-Filter α (b)Update step 1. Then the outputs of the sub-filter α .

Sub-Filter α (b)Update step. then along the line. The filtering of the IDWT is first to perform transforms along the column.3) wavelet. γ . the speed is 558 frame/s and the size of required memory is 72Kbits. see Fig.3) wavelet.3) wavelet is α= 1 1 = 2 −1 and β = = 2 − 2 . β and α in turn. K 1/ K Even column Regist er A Q 1 A Regist er Q1 A Regist er Q1 A Regist er Q1 A Regist er Q1 multiplication is reduced to the shift operations.3) and (9.12. Sub-Filter β Re r giste A Q1 A Re r giste Q1 A Re r giste Q 1 A Re r giste Q 1 A Re r giste Q1 H Evencolum n Q8 H Q8 H Q 8 H Q 8 H Q8 sel_en sel_en 0 1 0 1 Re r giste Re r giste Q1 A Q 1 A Re r giste Q1 A Re r giste Q1 A Re r giste Q1 E NB E NB E NB sel_en 0 1 ext_en1 sel_en 0 0 1 1 01 E NB β Q 1 Q 8 E NB Evencolum n A sel_en sel_en 0 1 0 1 Re r giste A Q1 H Q8 H Q8 H Q 8 H Q8 H Q8 H Q8 E NB E NB E NB sel_en 0 1 0 sel_en 1 E NB + Re r giste A H E NB << + Re r giste A Q1 H Q8 + Re r giste A Q1 ext_en3 Q8 01 01 Re r giste A Q1 H Q8 α Q1 Q8 E NB Evencolum n H E NB E NB 0 1 sel_en ext_en3 O colum dd n A H Re r giste Q1 + Re r giste A H E NB << + R egister A Q1 H Q8 + E NB E NB 0 1 sel_en O colum dd n ext_en1 Re r giste Re r giste Q 1 A Q 1 A H Q 8 H Q 8 Re r giste A Q1 A Re r giste Q 1 A Re r giste Q1 Register Even line A Q1 H Q8 ENB line buffer ext_en3 01 Register A Q1 A Register Q1 A Register Q1 Even line ext_en1 ext_en1 0 1 Register A Q1 A Register Q1 O colum dd n Re r giste A Q 1 A Re r giste Q1 A Re r giste Q1 Q8 H Q8 H Q 8 H Q8 E NB H Q 8 H Q8 H Q8 E NB E NB E NB E NB H Q8 H Q8 H Q8 H Q8 H Q8 ENB α Q1 Q8 ENB ENB ENB β Q1 Q8 ENB E NB E NB E NB E NB E NB Register Register Register Q1 ENB ENB ENB 1 + A H << 1 A H Q8 + A Q1 Odd line 0 Register Register Register Q1 H Q8 + A H ENB << 2 A H Q8 + A Q1 Even line H Q8 ENB ENB (c) Sub-Filter β (d) Sub-Filter α Register Register Q1 Register Q1 A Q1 Register Register Q1 A Q1 A Register Q1 Odd line A H Q8 line buffer A H Q8 H Q8 line buffer A Odd line H Q8 H Q8 H Q8 ENB ENB ENB ENB ENB ENB Fig. Row Processor of the IDWT IDWT E v en c o lu m n ROW PR O C ESSO R L L /Im a g e IV IMPLEMENTATION L L /Im a g e AD DR ESS ADDRESS GENERATOR L L /Im a g e M EM ORY CONTROL M EM ORY Fig. and the data path is 24 bits wide. The proposed architecture has the universality for 2-D DWT based on lifting scheme. The architectures of the sub-filters are the same as those of forward DWT correspondingly. Vol. Two pixels per clock cycle can be encoded at 100MHz. the 2 4 utilize the hardware resources of the forward DWT. Only 5% of total area is needed for implementation of the (5. Multilevel decomposition is implemented by the control unit. For (5. Sub-Filter β Regist er Regist er Q1 A Q1 Regist er Regist er Q1 A Q1 A Regist er Q1 Fig. Input LL HL LH HH subband data Regist er Regist er Q1 A Q1 A Regist er Q1 A Regist er Q1 A Regist er Q1 line buffer A H Q8 H Q8 H Q8 line buffer Odd column H Q8 H Q 8 ENB ENB ENB ENB ENB (a) Sub-Filter δ Regi ter s Regi ter s Regi ter s Q1 A Q1 H Q8 A Q1 (b) Sub-Filter γ Regi ter s A Q1 A Regi ter s Q1 A Regi ter s Q1 Even column ext_en1 ext_en1 1 0 Odd column 0 1 A H Q8 H Q8 ENB line buffer ext_en3 01 H Q8 H Q8 H Q8 ENB β Q1 Q8 ENB ENB Even column Regi ter s A Q1 Regi ter s Regi ter s α Q1 Q8 ENB ENB Even column Regi ter s + A H ENB << + H Q8 + A Q1 H Q8 + A H ENB ENB ENB << + Regi ter s A Q1 Regi ter s H Q8 + A Q1 H Q8 ENB ENB Odd column Odd column Regi ter s Regi ter s Q1 A Q1 A Regi ter s Q1 A Regi ter s Q1 Regi ter s Regi ter s Q1 A Q1 line buffer A H Q8 H Q8 H Q8 H Q8 line buffer A H Q8 H Q8 ENB ENB ENB ENB ENB ENB CONTROL E v en c o lu m n U N IT O d d c o lu m n (c) Sub-Filter β (d) Sub-Filter α COLUM N PR O C ESSO R O d d c o lu m n Fig.10. The filtering order of the IDWT in row and column processor is the sub-filers δ . For image 512 × 512 × 8bits. 2. MAY 2005 with those of (5.384 IEEE Transactions on Consumer Electronics. 51.11 and 12. and the three-step algorithm (add-multiplication-add) of each sub-filter is shown in Fig. the synthesis image or LL data are stored in the memory via the memory control. The memory requirement depends on the width of the image to be processed and the data path. The Architecture of IDWT The architectures of IDWT and DWT are symmetric because of the symmetry of lifting scheme for forward DWT and inverse DWT. No. Column Processor of the IDWT (c) Prediction step. The Architecture of IDWT The subband data are input the control unit. Because the coefficients of the (5. Sub-Filter α (d)Update step.7). Register A Q1 A Register Q1 A Register Q1 A Register Q1 A Register Q 1 sel_en sel_en 0 1 Even column 0 1 Regist er A Q1 A Regist er Q1 A Regist er Q1 A H Q8 H Q8 H Q8 H ENB ENB ENB Regist er Q1 A Regist er Q1 H Q 8 H Q 8 H Q 8 H Q8 H Q 8 ENB ENB ENB sel_en 0 1 0 1 ext_en2 sel_en 0 1 ENB δ Q1 Q8 ENB sel_en sel_en 0 1 0 1 Regist er A Q1 H Q8 Q8 H Q8 sel_en sel_en 0 1 0 1 ENB + Regist er A H 01 Regist er Regist er Q1 A Q1 A Regist er Q1 A Regist er Q1 ENB << + Regist er A Q1 H Q8 + Regist er A Q1 ext_en4 ext_en4 0 1 Regist er 01 01 Regist er A Q1 H Q8 γ Q 1 Q 8 ENB Even column H Q8 + Regist er A H ENB ENB ENB << + A R egister A Q 1 H Q 8 + ENB ENB 0 1 sel_en Odd column ext_en2 Regist er A Q1 A Regist er Q1 H Q8 H Q8 Regist er A Q1 A Regist er Q1 Regist er Q1 sel_en Q 8 Odd column Even Line K 1/ K Odd column A Q1 H A Q8 H Q8 H Q8 H Q8 H Q8 H Q8 H Q8 H Q8 sel_en ENB ENB ENB ENB ENB ENB ENB ENB ENB ENB ENB Register Register Q1 A Q 1 A Register Q 1 A Register Q1 A Register Q1 Even Line A sel_en sel_en 0 1 0 1 Register A Q 1 H Q8 H Q8 H Q8 H Q8 H ENB ENB ENB H Q8 H Q 8 H Q 8 H Q8 H Q8 ENB ENB ENB sel_en 0 1 sel_en 0 1 ENB ext_en3 ext_en3 Register Register Q1 A Q 1 01 01 Register A Q1 Q 8 H Q8 α Q1 Q8 ENB Even Line sel_en 0 1 Odd line Register A Q1 A Register Q1 A Register Q 1 H Q8 H Q8 H Q 8 0 1 + Register A H ENB << 1 A Re g i ter s A Q1 H Q8 + ext_en1 sel_en 0 1 01 ENB β Q1 Q8 ENB sel_en 0 1 0 1 H Q 8 EN B ENB 0 1 sel_en + Register A H ENB << 2 Register A Q 1 H Q 8 + Register A Q1 H Q8 ENB ENB 0 1 sel_en Odd line (a) Sub-Filter δ (b) Sub-Filter γ Register A Q 1 A Register Q 1 Register Q 1 Odd line A ext_en1 Register A Q1 A Register Q 1 H Q8 H H Q 8 H Q 8 H Q 8 ENB ENB ENB ENB ENB ENB H Q8 H Q8 ENB ENB ENB ENB ENB (a) Prediction step. Therefore. and two column data are output. The Row Processor(a and b) and Column Processor(c and d) of (5.9. . 25% of total area of the main chip which has about 25000 logic elements is needed for multilevel decomposition of wavelet which is implemented by the FSM. as shown in Fig. the IDWT can We have developed a RTL(Register Transfer Level) model of our architecture which is able to perform forward and inverse DWT of D’9/7 and 5/3 in FPGA.3) Lifting Scheme K Even column ext_en2 ext_en2 0 1 A H Q8 H Q8 line buffer Even column 01 Regist er A H Q8 H Q8 H Q8 ENB δ Q1 Q8 ENB ENB γ Q1 Q8 ENB ENB Even column 1/ K Regist er Odd column 1 + A H ENB << + Regist er A Q1 Regist er H Q8 + A Q1 H Q8 ext_en4 + A H ENB ENB ENB << + Regist er A Q1 Regist er H Q8 + A Q1 0 H Q8 ENB ENB Odd column F.9. one level at a time. After the data are computed by the column processor and row processor.11.10. there levels of decomposition using D’9/7.

1103~1127. and the general co-chair for the International Symposium on Nonlinear Theory and Its Applications in 2002. computational intelligence. Xi’an Jiaotong University. Seth S. and the PhD degree in electrical engineering from Keio University. 2) Two pixels are computed at one clock. “The JPEG2000 still image coding system: An overview. [5] Sweldens W.B. Xi’an. Information technology — JPEG 2000 image coding system: Extensions.” Proc SPIE.9.247~269. NanNing Zheng (SM’93) graduated in 1975 from the Department of Electrical Engineering. The main advantages of proposed architecture are as follows. no.305-316. and row processor performs in parallel with column processor. pp. He is currently a professor and the director of the Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. pattern recognition. “The lifting scheme: a new philosophy in biorthogonal wavelet constructions. 1) Time-multiplexing row processor and line-based design way minimize the on-chip storage cell. 4) Universality for 2-D DWT/IDWT based on lifting scheme and multilevel decomposition are allowed through cascading the proposed architecture. and increase the utilization. “VLSI implementation of 2-D DWT/IDWT Cores using 9/7-tap filter banks based on the non-expansive symmetric extension scheme. and hardware implementation of intelligent systems.2000.4. China in 1981. respectively. Vishwanath. [10] K. “Factoring wavelet transforms into lifting schemes. His research interests include computer vision.2000.Georis.Yew-San Lee.vol. pp.M.944-950. Since 2000. degree in computer science& engineering at Xi’an Jiaotong University. .X. On circuit and systems-11.” IEEE International Symposium on Circuits and Systems.Skodras. and VLSI design. Wen-shiaw Peng.50.46. Owens. [8] Andra K.7. no. Japan.Irwin. He is a senior member of IEEE. Lan et al. He served as the general chair for the International Symposium on Information Theory and Its Applications in 2002. JPEG2000 part 1 final committee draft version 1. vol.vol. 2001. hardware implementation of intelligent systems.13. pp. [3] C.4. [6] M. Yuehu Liu received a B. [11] ISO_IEC_15444-2:2004(E).E. He presently serves as executive editor of Chinese Science Bulletin.1158~1170. Xi’an. “High performance scalable image compression with EBCOT.2004. Xi’an JiaoTong University.A.” IEEE trans. 3) The optimized shift-add operations are substituted for multiplications. and reduce the hardware consumption.J.M. “Combined line-based architecture for the 5-3 and 9-7 wavelet transform for JPEG2000.0. pp.5.R. China in 1984 and 1989. [7] G. [9]Wei-Hsin Chang. Japan. Shan Dong University of Science and Technology in 1999. His research interests include image/video processing. His research interests include computer vision. no. Xuguang Lan received the BS degree from the College of Automobile Engineering.Srinivasan. and periodic extension at the boundaries is implemented by embedded circuit.” Proceeding of the 15th international conference on VLSI Design.1995.Christopoulos. pp. he has been the Chinese representative on the Governing Board of the International Association for Pattern Recognition. and the MS degree from the College of Transportation Engineering.: Low-Power and High-Speed VLSI Architecture For Lifting-Based Forward and Inverse Wavelet Transform 385 V CONCLUSION A low-power and high-speed architecture which performs the 2D DWT/IDWT for JPEG2000 is proposed in this paper. [2] Taubman D. in1985.” IEEE Trans on Signal Processing. These reduce the quantity of computation and the power.vol. vol.” IEEE Transactions on Image Processing. China.” IEEE Transaction on Consumer Electronics. “A Line-based. image processing. 2002. and the PhD degree in electrical engineering from Keio University. computational intelligence.1998. 2003. image processing.pp. Now he is a PhD student of the Institute of Artificial Intelligence and Robotics. He became a member of the Chinese Academy Engineering in 1999. and the proposed architecture is optimized in pipelined way. Kunming University of Science and Technology in 2002.966-977.” IEEE Transaction on Circuits and Systems for Video Technology. Chen-Yi Lee. pattern recognition. [4] Daubechies I.42. no. Appl. He is currently a professor and the vice director of the Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University.2000.” J.1995. 2002. in 2000. REFERENCES [1] ISO/IEC JTC 1/SC 29/WG 1 N1646R. Fourier Anal. Chakrabarti C. “VLSI architecture for the discrete wavelet transform. Sweldens W.9.E. and a M. received the ME degree in information and control engineering from Xi’an JiaoTong University.Dillen. vol. “A VLSI architecture for lifting-based forward and inverse wavelet transform. memory efficient and programmable architecture for 2D DWT using lifting sheme.

- C1,C2
- 38276463-Chap-13-Sp-1-Solutions
- Feedback - Call Drop Due to Samsung Galaxy S2
- NEW INVENTION APPROACH OF AN FIR TRICKLES LAUNCHED FPGA-EXECUTION FOR A BIO-INSPIRED MEDICAL HEARING AID
- Efficient speaker identification from speech transmitted over Bluetooth networks.pdf
- ABCD
- Pulse-width Modulation - Wikipedia, The Free Encyclopedia
- Toshiba Error Codes
- wen_yangyang_200705_phd.pdf
- ERONE AM2.B
- Introduction ADSP
- Chapter XXIX
- 5
- SPF_C10
- Dropped and blocked call analysis.pptx
- Lakkireddybalreddi-m.tech_ece - Systems and Signal Processing_syllabus
- Calculation of Convolution
- Implementation and Performance Analysis of Video Edge Detection System on Multiprocessor Platform
- Lecture 18
- 20100525163420-DS_VMR6512_EN.pdf
- Channel Fading and Channel Equalization
- Elrec6_Gb
- Real Time Wavelet Video Denoising in FPGA
- presentation1-101004122851-phpapp02
- APPLE TO APPLE PIPELINE LOCATOR.doc
- xpfm
- Tutorial 1 signals and systems
- 05609175
- 03-Unit3
- Club News - MAR 2017 (Wk 13)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading