You are on page 1of 26

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 1

Memory-Efcient Architecture for 3-D DWT Using Overlapped Grouping of Frames


Basant K. Mohanty, Member, IEEE and Pramod K. Meher, Senior Member, IEEE
Abstract In this paper we have presented a memory efcient architecture for 3-D DWT using overlapped grouping of frames. Proposed structure does not involve any line-buffer or frame-buffer for 1-level 3-D DWT. It involves only a frame-buffer of size O(M N ) to compute multilevel 3-D DWT, unlike the existing folded structures which involve frame-buffer of size O(M N R). The saving of line-buffer and frame-buffer by the proposed structure for the implementation of rst-level DWT is of substantial advantage, since the frame-size is very often as large as 1920 1080 and frame-rate varies from 15 to 60 fps. The proposed structure has a small cycle period, and offers small output latency compared to the existing structures. Compared with the best of the available designs, the proposed design involves signicantly less memory words. For frame-size 176 144 and frame-rate 60 fps, the proposed structure involves 7.96 times less memory words and involves 12.3% less average computation time (ACT) than the best of the existing folded designs. It involves 4.28 times less memory words than the recently proposed parallel design. The synthesis result for frame-size 176 144 and frame-rate 60 fps for the FPGA device 6VLX760FF1760-2 shows that the proposed structure involves 9.6 times less BRAMs and offers 2 times higher throughput than the folded design. It involves 1.9 times less BRAMs than the parallel design and offers nearly same throughput rate. The proposed structure has signicantly less slice-delay-product (SDP) than the existing structures. Due to less memory complexity, the proposed structure dissipates signicantly less dynamic power than the existing structures. Index Terms Discrete wavelet transform, 3-dimensional DWT, Overlapping frames, parallel and pipeline architecture, VLSI

I. I NTRODUCTION

HREE-dimensional (3-D) discrete wavelet transform (DWT) is applied in video compression, compression of 3-D and 4-D medical images, volumetric image compression, video watermarking and many other applications [1][6].

The generic structure for the computation of multilevel-level of 3-D DWT based on the popularly used separable approach is shown in Fig.1, where the intra-frame DWT is performed row-wise then column-wise by the row-processor and then columnprocessor, respectively, and inter-frame computation is performed by the temporal-processor. As shown in the gure, the 3-D DWT structure is comprised of two types of hardware components: (i) combinational component and (ii) memory/storage component. The combinational component consists mainly of arithmetic circuits and multiplexors; and the memory component consists of a frame-memory, temporal-memory, registers and transposition-memory. Frame-memory is usually external to the chip, while temporal-memory may either be on-chip or external. The on-chip transposition-memory stores the intermediate values resulting after the row processing, while the temporal-memory stores the intermediate values resulting after the column processing of a set of successive frames. The frame-memory is used for storing low-low-low (LLL) subband to compute the
Manuscript submitted on January 26, revised on 23 May and 6 July 2011. This paper was recommended by Associate Editor Tong Zhang. B. K. Mohanty is with the Dept. of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhy Pradesh, India-473226, (email: bk.mohanti@juet.ac.in). P. K. Meher is with the Department of Embedded Systems, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632, Email: pkmeher@i2r.astar.edu.sg, URL: http://www.ntu.edu.sg/home/aspkmeher/.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 2

LOW LOW LOW SUBBAND

ROW PROCESSOR COLUMN PROCESSOR TEMPORAL PROCESSOR

TRANSPOSITION MEMORY

FRAME MORY MEM

TEMPORAL MEMORY

Fig. 1.

Generic structure of multi-level 3-D DWT computation

multilevel 3-D DWT level-by-level [7]. The size of frame-memory is M N R, while the size of temporal-memory is KM N and transposition-memory is of size KN , where M and N is the height and width of each video frame, R is number of frames in a group of frames (GOFs) and K is the order of the wavelet lter. In general, the complexity of the combinational component of the 3-D DWT structure depends on the lter order which is usually small, while the complexity of memory component depends on the frame-size. Since the frame-size for practical video applications may vary from 176 144 (for low-end mobile phones) to 1920 1080 (for HDTV), the computation of 3-D DWT is highly memory-intensive and it is a challenging task to implement the complete 3-D DWT in a single chip. But, on the other hand the communication with external memory heavily degrades the speed and power performance of the whole system. A few computing schemes and architectures are suggested to reduce the memory requirement of 3-D DWT [8][10], [13], [15], [16]. A more detail discussions on these designs are given in [16]. The existing computation schemes are still found to involve very large memory. Also, it is observed that the existing block-by-block methods degrade the PSNR quality due to the blocking artifacts, since the transformation of innite video frames into independent GOFs introduces noise at the boundary, which results in loss of PSNR. To avoid this data loss, DWT need to be performed continuously on the innite video sequence (called the running 3-D DWT) [11]. Keeping this in view, Das et al [12][14] have suggested scan based architecture for running 3-D DWT of innite GOFs. But the memory requirement of this structure is signicantly higher than that of [15]. The transform coder should decompose the 3-D signals in multiple-levels to achieve higher compression ratio. But, most of the existing designs [12][14] compute only 1-level running 3-D DWT of the innite video sequence. The structure of
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 3

[15], however, computes the multilevel 3-D DWT by level-by-level approach, (similar to the folded scheme proposed by Wu et al [7]) using an external frame-buffer of size (M N R)/8. Other existing structures which transform the 3-D signal into independent GOFs can compute the multilevel 3-D DWT in time-multiplexed form in a folded structure. But, the multilevel 3-D DWT of an innite GOFs cannot be implemented by such folded method using a limited frame-buffer. Since, the size of the transposition-memory and temporal-memory remain unchanged with input-block size, they could be utilized efciently by calculating more DWT coefcients concurrently. The frame-buffer size could be reduced by using separate computing blocks for different decomposition levels. To take care of this issue, recently we have proposed a parallel architecture for multilevel 3-D DWT [16], which computes multilevel running 3-D DWT on innite GOFs and overcomes the limitations of other existing structures. However, it has some inherent problems associated with the selection of input-block size (Q) for a given frame. The input-block size need to be an integer multiple of the frame width (N ), and to achieve 100% hardware utilization efciency (HUE), the minimum block size for J level DWT is 23J2 . For example, the HUE of the pipeline structure for J = 3 is 100% for Q = 128, 98.64% for Q = 64, and 91.25% for Q = 16. From this observation we can infer that higher the input block-size better is the resource utilization of the structure. But the structure demands more device resources and I/O for higher block-sizes. If the application is a resource-constrained, then it can, however, be implemented for lower block-sizes with less than 100% HUE. To overcome the above difculties we propose here an alternative approach to compute multilevel running 3-D DWT on innite GOFs. We have derived a pipeline architecture using the proposed scheme and which involves less on-chip and off-chip memory than the existing structures. Interestingly, the input bock-size of the proposed structure is independent of the DWT levels. The key ideas we have used in our current approach are:

While grouping the frames, some boundary frames of consecutive groups of frames are overlapped, in order to avoid temporal memory, which would otherwise have been needed by the temporal processor.

Row-column processing of each input frame is scheduled suitably, such that the row-processor generates the required intermediate results to be consumed by the column-processor without transposition.

Processing of overlapping frames involves some redundant computation. The number of frames to be overlapped at

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 4

Input data fed d line by line and y frame by frame

Processor (1 level 3 D DWT) F1 FK Processor (2 level 3 D DWT) [Transposition memory] (size ( i KN/2) [Temporal memory] (size KMN/4) 2 level 3 D DWT components

R RP

C CP

Transposition memory (size KN)

Temporal memory (size KMN) (a)

Overlapped GOF 2 O l d

F3

F4

T TP

FK

FK+1

FK+2

Overlapped GOF 1

F1 K

F2 K

FK 2 K

FK 1 K

FK K

Processo or ( 1 level 3 D DWT) D

RP CP

RP CP TP

RP CP

RP CP

RP CP

1 level 3 D DWT components (b)


Fig. 2. (a) Conventional method for computing 2-level running 3-D DWT. (b) Computing method of 1-level 3-D running DWT using extended overlapping frames. Legend: RP : row-processor, CP: column-processor, TP: Temporal-processor, K is the size of the wavelet lter, Fk is k-th input frame, M and N , respectively, the height and width of the input frames. Overlap frames are shown in grey color. The grey colored RP and CP performs redundant computations.

different DWT levels is, therefore, decided carefully so that the computational overhead is not very high.

DWT computation of higher levels are time-multiplexed to minimize the overall hardware of the structure.

It is shown that, the proposed structure involves (32N R/15M ) times less on-chip memory than the folded structure of [15] and 2.45N times less on-chip memory than the parallel structure of [16]. The memory saving in the proposed structure results in substantial reduction of overall area, since the frame size and frame rate of videos used in various video applications could be very large. The remainder of the paper is organized as follows: Memory efcient 3-D DWT computation using overlapped group of frames is presented in Section II. Proposed architecture for multilevel 3-D DWT is described in Section III. Optimized implementation of the subcells of the proposed structure based on Daubechies four-tap wavelets is discussed in Section IV. Hardware-complexity and performance of the proposed structure are estimated and compared with existing architectures in Section V. Conclusions are presented in Section VI.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 5

Overlapped GOF 2 Overlapped GOF 1 Processor ( 1 level 3 D DWT)

F5

F6

F7

F3K

F3K+1

F3K+2

F1 K

F2 K

F3 K

F3K 4 K

F3K 3 K

F3K 2 K

RP CP

RP CP

RP From CP K th CP From 2K th CP

RP CP

RP CP

RP CP

First TP
F1 (LLL)

K th TP
From 2nd TP F2
FK 1 (LLL)

(LLL)

From (K 1) th TP FK

(LLL)

P Processor ( 2 lev 3 D DW vel WT)

RP Memory (KN/2)
CP

RP Memory (KN/2) CP TP

RP Memory (KN/2) CP

RP Memory (KN/2) CP

2 level 3 D DWT components


Fig. 3. Computing method of 2-level 3-D running DWT using extended overlapping frames. Overlap frames are shown in grey color. Processors shown in grey colore perform redundant computations. Legend: RP : row-processor, CP: column-processor, TP: Temporal-processor, K is the size of the wavelet lter, M and N , respectively, the height and width of the input frames.

II. M EMORY E FFICIENT 3-D DWT U SING G ROUP OF OVERLAPPED F RAMES

As shown in Fig.2(a), temporal-processor (TP) performs down-sampled lter computation on the intermediate frames generated by the column-processor (CP). To transform each intermediate frame, TP performs lter computation on a group of K successive intermediate frames (the current frame and K 1 past frames), where K is the order of the wavelet lter. Note that, the successive group of K frames corresponding to temporal computation of two consecutive frames are overlapped by K 1 frames. Since TP performs down-sampled lter computation, successive group of intermediate frames are overlapped by K 2 frames. According to the conventional method, the temporal-memory stores K successive intermediate frames for TP. Instead of that, the required intermediate frames could be made available to the TP if a group of K extended frames are transformed concurrently by the column and row-processors and the successive group of extended frames are overlapped
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 6

by (K 2) frames. As shown in Fig.2(b), due to frame overlapping, row and column DWT computation corresponding to overlapping frames, however, necessitates some redundant computation. The transposition- and temporal-memory of 1-level 3-D DWT can be avoided completely at the cost of extra (K 2) pairs row and column processors to process the overlapping frames. In spite of using additional row- and column-processors, overlapping group of frames has the potential to provide substantial saving in hardware by eliminating the transposition-memory of O(N ) and temporal-memory of O(M N ), since the size of the wavelet lters is usually very small compared to the frame size of all practical videos.
TABLE I I NPUT AND
OUTPUT FRAMES TO ELIMINATE TEMPORAL - MEMORY

one output frame IF IF1 to IFK IF3 to IFK+2 IF5 to IFK+4 : IFn to IFn+K1 LEGEND: IF : input-frame, OF: output-frame, n = 2K 1 OF OF1 OF2 OF3 : OFK

group of output frame IF IF1 to IFK+2 IF1 to IFK+4 F1 to IFK+6 : IF1 to IF3K2 OF OF1 and OF2 OF1 to OF3 OF1 to OF4 : OF1 to OFK

TABLE II N UMBER
OF OVERLAPPING INPUT FRAMES FOR PRODUCING OUTPUT FRAMES WITH DIFFERENT OVERLAPPING SIZE

Number of input frames K

Number of output frames 1

Overlapping input frames K2 K2 K K +2 : 3K 6

Overlapping output frames 0 0 1 2 : K2

3K 2

Input frames are to be processed in parallel for temporal-memory-free computation is given in Table I. From Table I, we can nd that the group of input frames are required to be overlapped by K 2 frames. Note that, successive groups of output frames in this case do not overlap. To produce the overlapped group of output frames, number of overlapping frames of the input GOFs need to be increased. Table II shows the required number of input frames to be overlapped to produce the overlapped GOFs. Using Table I and Table II, one can nd the number of frames in a GOFs to be processed in parallel to avoid temporal-memory in multilevel 3-D DWT computation. The number of overlapping frames for decomposition level-1 and level-2 is shown in Table III. The overlapping input frames of level-1 DWT are shown in grey color in Fig.2(b) and those of level-2 DWT in Fig.3.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 7

TABLE III OVERLAPPING INPUT FRAMES FOR TWO LEVEL DECOMPOSITION decomposition level level-1 level-2 Number of input frames 3K 2 K Number of overlapping frames 3K 6 K2 Number of output frames K 1

TABLE IV C OMPARISON OF M EMORY S AVING AND HARDWARE COST INVOLVED BY THE 3-D DWT
COMPUTING STRUCTURE USING THE PROPOSED OVERLAPPING GROUP OF FRAMES FOR DIFFERENT SIZE WAVELET FILTERS

Level-1 Filter Hardware cost GOF (K) Haar Daub-4 Daub-6 Daub-8 Daub-10 5/3 9/7 2 4 6 8 10 5 9 OVF =K2 0 2 4 6 8 3 7 Memory KN (M + 1) 2N (M + 1) 4N (M + 1) 6N (M + 1) 8N (M + 1) 10N (M + 1) 5N (M + 1) 9N (M + 1) Multiplier 4K 0 32 96 192 320 30 126 Adder 4K 0 32 96 192 320 48 224 GOF (3K 2) 4 10 16 22 28 13 25 OVF 3 0 6 12 18 24 9 21

Level-2 Memory KN (5M 2K + 6)/4 N (5M + 2)/2 N (5M 2) 3N (5M 6)/2 2N (5M 10) 5N (5M 14)/2 5N (5M 4)/4 9N (5M 12)/4 Hardware cost Multiplier 18K 0 144 432 864 1440 135 567 Adder 18K 0 144 432 864 1440 216 1008

LEGEND: GOF: group of frames, OVF: overlapping frames. It is assumed that symmetric property of 5/3 and 9/7 lter coefcients are used and these lters are implemented using convolution method. For 5/3 and 9/7 lter 2K equals to 5 and 9, where equals to 3 and 7. Since the row processor does not involve any data registers and the column processor involves only (K 1) data registers, we have excluded these registers while estimating the hardware cost as the complexity of data registers is very small compared to multiplier and adder complexity.

As shown in Fig.2b, K pairs of row and column processors (K 2 overlapping frames shown in grey color) perform the necessary computation of K frames to avoid temporal-memory (of size KM N ) of level-1 DWT. Similarly, from Fig.3 we can nd that, (4K 2) RPs, (4K 2) CPs and (K + 1) TPs perform the necessary computation of (3K 2) overlapped GOFs to avoid temporal-memory of level-1 and level-2. The row-column processing of decomposition level-1 is scheduled in such a way that row-processor generates the required intermediate results to be consumed by the column-processor without transposition. Using overlapped grouping of frames and row-column processing scheme, memory space of KN (5M 2K + 6)/4 words of 2-level DWT can be avoided at the cost of 4(K 2) RPs, 4(K 2) CPs and (K 2) TPs. To study the efciency of the proposed frame overlapping computing scheme, we have estimated the memory (in words) that could be saved and the extra hardware cost (in terms of multipliers and adders) involved to compute row and column transformation of the overlapping frames. We have assumed the hardware complexity of each RP, CP and TP to be 2K multipliers and 2K adders, where K is the lter order. The estimated memory that can be saved by using the proposed scheme over the conventional method (frame-by-frame without frame overlapping) and the extra hardware cost involved for different
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 8

C OMPARISON OF S AVING TO OVERHEAD R ATIO (SOR)

INVOLVED BY THE

TABLE V 3-D DWT

COMPUTING STRUCTURE USING OVERLAPPING GROUP OF FRAMES

OF DIFFERENT FRAME - SIZES

Filter Haar Daub-4 Daub-6 Daub-8 Daub-10 5/3 9/7

DWT Level 1 2 1 2 1 2 1 2 1 2 1 2 1 2

SOR for frame-size 176 144 127.5 33.6 63.7 16.73 42.5 11.3 31.8 8.8 151.3 39.9 62.7 16.45 640 480 1540.2 408 770 203.8 513.3 138.2 385 107.9 1826.7 486 758 201.4 1920 1080 10385.6 2755.8 5192.4 1377.3 3461.6 935.2 2596.2 730.7 12317.5 3284 5111.6 1363.4

SOR is dened as = Saving in memory words / Combinational overhead cost, where combinational overhead cost represents the sum of the multipliers and adders required to process overlapping frames. Memory words and hardware cost are measured in terms of transistor counts.

wavelet lters are listed in Table IV. Since, frame-size of a practical video could be as high as 1920 1080 (screen size of HDTV), signicant amount of chip area could be saved by eliminating the transposition and temporal-memory of the 3-D structure using the proposed scheme. It can found from Table IV that, the amount of memory saving offered by the proposed scheme for two-level DWT is nearly 25% more than that of 1-level DWT, but 2-level DWT involves nearly 4.5 times extra hardware cost than that of 1-level DWT. This is mainly due to reduction in frame-size by factor of 4 after every decomposition level, while the number of overlapping frames required to avoid the temporal-memory increases steadily by (2K 4) frames for every higher level. The size of the group of frames also increases by (2j1 K) times after every higher level of DWT, where j is the DWT level. To measure the memory saving and combinational overhead of the proposed scheme, we have dened a term saving to overhead ratio (SOR). Since the complexity of memory and arithmetic components are widely different, we have estimated saving of memory and combinational overhead cost in terms of transistor-counts to measure SOR which is dened as the ratio of the memory saved and combinational overhead cost. For the 3-D DWT structure, we assumed the input pixels are 8-bit and all the intermediate and nal output signals are 12-bit. Few of the multipliers of 1-level and 2-level structure are of 8-bit size where all other components are of 12-bit size. Out of 4K multipliers of level-1, 2K multipliers are of 8-bit size. Similarly,
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 9

out of 18K multipliers of 2-level structure, 6K are of 8-bit size. The transistor count of 8-bit multiplier, 12-bit multiplier, 12-bit adder and 12-bit SRAM word are taken to be 1178, 1674, 372 and 72 transistors, respectively. Using these values, we have estimated the SOR for different wavelet lters and frame-sizes. The values are listed in Table V. It can be found from Table V that, SOR is maximum for Haar wavelet (K = 2), as in this case the entire transposition and temporal memory could be eliminated from the 3-D structure without any redundant computation. For higher frame-sizes, and low-order lters like Daub-4 and 5/3, SOR is signicantly higher. We nd that, SOR of the 2-level DWT is nearly 73% less than that of 1-level DWT on average for different wavelet lters and frame-sizes. Keeping this facts in mind, we outline the proposed method to derive a memory efcient hardware structure to compute of multilevel running 3-D DWT..

low-order wavelet lters like Harr, Daub-4, Daub-6 and 5/3 should be preferred for 3-D DWT if it meets the desired SNR specication of the target application.

proposed frame overlapping processing scheme should be applied to eliminate the temporal-memory of 1-level only to get maximum advantage of the scheme.

computation of higher DWT levels may be partitioned and appropriately scheduled to utilize the resource effectively.

Although the hardware cost is marginal (less than 2%) of the memory that saved if we apply redundant computation to 1level only and use wavelet lters (Daub-4, Daub-6, 5/3 and 9/7), the hardware cost could be reduced further by implementing multipliers using some low-complexity design method. In this work we have considered Daubechies 4-tap (Daub-4) wavelet lters as an example to derive the proposed structure. However, similar type of structures could be derived for other wavelet lters as well. We also suggested an efcient design for implementation of Daubechies wavelet lters for K = 4. III. P ROPOSED A RCHITECTURES FOR 3- LEVEL 3-D DWT

The proposed structure for the implementation of 3-level 3-D DWT is shown in Fig.4. It consists of three processing units (PUs). PU-1 performs the computation of rst-level DWT, while PU-2 computes only row and column DWT of the second-level. PU-3 computes temporal DWT of the second-level and the entire computation of third-level in time-multiplexed form. PU-1 receives four input blocks of four successive parallel frames in every cycle from the input buffer. The input blocks
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 10

are fed to the structure as per the order shown in Fig.5. As shown in Fig.5, each input block contains 6 consecutive samples of a particular row. The input block I(m1 , m2 , n3 ) corresponding to m2 -th row of m1 -th frame and contain the samples {x(m1 , m2 , 4n3 + 5), x(m1 , m2 , 4n3 + 4), x(m1 , m2 , 4n3 + 3), x(m1 , m2 , 4n3 + 2), x(m1 , m2 , 4n3 + 1), x(m1 , m2 , 4n3 )}, for 0 m2 M 1, 0 n3 (N/4) 1 and m1 = 0, 1, 2, 3, .... The adjacent input blocks of a particular row are overlapped by 2 samples. Suppose in the rst cycle, the rst input block of the rst row of a frame is fed then during the second cycle, the rst input block of the second row is fed to the structure, such that the rst input blocks of all the M rows of a particular frame are fed in M cycles and in the next set of M cycles, second input blocks of all the M rows are fed to the structure. The entire M N/4 input blocks of a particular frame are fed to the structure in M N/4 cycles. Input blocks of four successive parallel frames are fed to the structure in M N/4 cycles in parallel. The successive group of frames are overlapped by 2 frames. During the rst set of M N/4 cycles, input blocks of rst GOFs (F1 , F2 , F3 , F4 ) are fed to the structure and in the next set of M N/4 cycles, input blocks from the GOFs (F3 , F4 , F5 , F6 ) are fed. In this manner, input blocks of an innite GOFs are fed to PU-1 continuously to compute rst-level 3-D DWT.
Input Blocks from t frame buff fer zlh1 zhh1 vh 2 Outp put

PU 1
6 samples / cycle

PU 2
zhl1 zllh1

vl2

PU 3

zlll2 ul3 uh3 vl3 vh3

1 1 1 1 1 1 1 1 1 2 2 Fig. 4. Proposed structure for computation of 3-level 3-D DWT. zlh , zhl , zhh , respectively, represent (zlhl , zlhh ), (zhll , zhlh ) and (zhhl , zhhh ). vl and vh , 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 respectively, represent (vll , vhl ) and (vlh , vhh ). Output represent (zllh , zlhl , zlhh ), (zhll , zhlh , zhhl , zhhh ) or (zlll , zllh , zlhl , zlhh ), (zhll , zhlh , zhhl , zhhh ).

F6 F5 x07 x06 x05 x04 x03 x02 x01 x00 07 06 05 04 03 F 02 01 00 4 x07 x06 x05 x04 x03 x02 x01 x00 F3 x07 x06 x05 x04 x03 x02 x01 x00 x11 x10 x17 x16 x15 F x14 x13 x12 2 x07 x06 x05 x04 x03 x02 x01 x00 x11 x10 x17 x16 x15Fx14 x13 x12 x07 x06 x05 x03 x03 x02 x01 x12 x11 10 x04 x17 x16 x15 x14 x13 12 11 16 15 14 113 x07 x06 x05 x0417x15x02 x01 x00 x00 x10 x10 x17 x16 x14 :x13: x12 :x11: : : : : x16 x15 x14 x13 :x13:x11 :x11: x10 : : : : x17 x16 x15 x14 x12 x12 x10 x17 : : : : : : : : GOF-1 GOF 1

GOF-2

: : : : : : : ::x77 : :: x75 x74 x73 x72 x71 x70 x76 :x77 x76: x75 x74: x73 x72 x71 x70 : : : : : : : : x77 x76 x75 x74 x73 x72 x71 x70 : : : x7: : x7 : x7: : x77 x76 x75 x74 x73 x72 x71 x70 : : x74 x73 x72 x71: x70 : : : : : x77 x76 x75 : : : : : x77 x76 x75 x74 x73 x72 x71: x70 :: : : : :: : :: : :: : : :: : : :: : :
6 samples/cycle First set of MN/4 cycles Second set of MN/4 cycles

: ::

Fig. 5. Data input format of the proposed structure. Grey color boxes represent overlap area of the adjacent blocks, while the overlapping frames are shown in violet color.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 11

I(2n1,m2,n3)

I(2n1-1,m2, n3)

I(2n1-2,m2,n3)

I(2n1-3,m2,n3)

PE 1
vh1(2 1,n2,2n3-1) 2n vl1(2 1-1,n2,2n3) 2n n vl1(2 1,n2,2n3-1) 2n vl1(2n1,n2,2n3) ( vh1(2n1,n2,2n3) (

PE 2
vh1(2n1-1,n2,2n3-1) n vl1(2 1-2,n2,2n3) 2n n vl1(2n1-1,n2,2n3-1) vh1(2n1-1,n2,2n3) n

PE 3
vh1(2n1-2,n2,2n3-1) n vl1(2n1-2,n2,2n3-1) vl1(2 1-3,n2,2n3) 2n n vh1(2n1-2,n2,2n3) n

PE 4
vl1(2n1-3,n2,2n3-1) vh1(2n1-3,n2,2n3) n vh1(2n1-3,n2,2n3-1) n

Subcell 1
zlh1(n1,n2,2n3) zll
1(n 1,n2,2n3)

Subcell 1
zlh1(n1,n2,2n3-1)
1(n 1,n2,2n3-1)

Subcell 1
zhh
1(n 1(n 1,n2,2n3)

Subcell 1
zhh1(n1,n2,2n3-1)
1,n2,2n3-1)

zll

zhl

1,n2,2n3)

zhl

1(n

1 1 1 1 1 1 1 1 1 1 Fig. 6. Structure of PU-1. Intermediate outputs vl , vh , respectively, represent, (vll , vhl ) and (vlh , vhh ). Similarly outputs zll , zlh , zhl , zhh , respectively, 1 1 1 1 1 1 1 1 represent (zlll , zhll ), (zllh , zhlh ), (zlhl , zhhl ) and (zlhh , zhhh )

I(m1,m2, n3)

First Subcell 1
ul(m1,m2,2n3) m n uh(m1,m2,2n3) m

Second Subcell 1
ul(m1,m2,2n3-1) uh(m1,m2,2n3-1)

First Subcell 2
vl(m1,n2,2n3) vh(m1,n2,2n3)

Second Subcell 2
vl(m1,n2,2n3-1) vh(m1,n2,2n3-1)

1 1 1 Fig. 7. Structure of the processing element (PE). Outputs vl (m1 , m2 , 2n3 ) and vh (m1 , m2 , 2n3 ), respectively, represent (vll (m1 , m2 , 2n3 ), 1 1 1 vhl (m1 , m2 , 2n3 )) and (vlh (m1 , m2 , 2n3 ), vhh (m1 , m2 , 2n3 )), where 0 m2 M 1, 0 n2 M/2 1 and 0 n3 N/4 1 .

The structure of PU-1 is shown in Fig.6. It is comprised of four identical processing elements (PEs) and four subcells. In every cycle, each PE receives one input block from a particular frame such that four PEs receive four input blocks from four successive frames. Each PE computes 2-D (row and column) DWT pertaining to each input frame and calculate components
1 1 1 1 of four subband (vll , vlh , vhl , vhh ) components. The structure of the PE is shown in Fig.7. Each PE is comprised of one pair of

subcell-1 and subcell-2. Subcell-1 computes 1-D DWT (low-pass and high-pass ltering) computation along the row-direction and calculate a pair of intermediate coefcients (ul , uh ), in every cycle. The structure of subcell-1 is identical to the structure of subcell-1 of [16] (see Fig.3 of [16]). In every cycle, (i + 1)-th PE of PU-1 receives the input block I(2n1 i, m2 , n3 ) from (2n1 i)-th input frame. From each input block I(2n1 i, m2 , n3 ), second subcell-1 receives rst four samples {x(2n1 i, m2 , 4n3 + 3), x(2n1 i, m2 , 4n3 +
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 12

2), x(2n1 i, m2 , 4n3 + 1), x(2n1 i, m2 , 4n3 )} and computes a pair of intermediate coefcients ul (2n1 i, m2 , 2n3 1) and uh (2n1 i, m2 , 2n3 1). During the same period, rst subcell-1 receives last four samples {x(2n1 i, m2 , 4n3 + 5), x(2n1 i, m2 , 4n3 + 4), x(2n1 i, m2 , 4n3 + 3), x(2n1 i, m2 , 4n3 + 2)} of the input block and computes the intermediate coefcients ul (2n1 i, m2 , 2n3 ) and uh (2n1 i, m2 , 2n3 ). Note that the successive output samples of subcell-1 belong to the same column and these intermediate coefcients can be processed directly by subcell-2 for column DWT. In each cycle, subcell-2 receives a pair of intermediate coefcients from the corresponding subcell-1 and computes the column-DWT in time-multiplexed form to take the advantage of down-sampled lter computation. The structure of subcell-2 is similar to the structure of subcell-2 of [16] (see Fig.4 and Fig.5 of [16]), except that each shift-register (SRs) in this case is replaced with register (R).

After a latency of 3 cycles, each subcell-2 produces a pair of subband components (vll /vhl ) and (vlh /vhh ) in each cycle, such that during the even-numbered cycles, if it produces one component each of the pair of subbands vll and vlh , then during the odd-numbered cycles, it produces components of other two subbands (vhl and vhh ). Both subcell-1 and subcell-2 work in separate pipeline stages and compute DWT computation concurrently. Each PE calculates DWT components of pair of columns of each of the four subband components of a given frame in M cycles, where the components of the subband (vll , vlh ) and (vhl , vhh ) are obtained in time-multiplexed form. The (i + 1)-th PE, therefore, completes the rst level decomposition of the (2n1 i)-th frame of size (M N ), in M N/4 cycles with initial latency of 3 cycles.

The adjacent PEs of PU-1 generates DWT components corresponding to two successive frames. DWT components of four
1 1 successive frames are obtained from four PEs such that 2 columns of DWT components of a pair of subbands (vll , vlh ) or 1 1 (vhl , vhh ) of four successive parallel frames are obtained from four PEs. Down-sampled lter computations are performed on

each of the subband coefcients generated by the PE for temporal (inter-frame) DWT. Temporal DWT computations can be performed using subcell-1. PU-1, therefore, use four subcell-1 (see Fig.6) to calculate the temporal DWT of the columns of
1 1 1 1 the subbands (vll , vlh ) or (vhl , vhh ) of four successive frames concurrently. Out of these four subcells, rst and third subcell-1, 1 1 respectively, calculate temporal DWT of the even and odd numbered columns of the subbands (vll or vhl ), while the second and 1 1 the fourth subcell-1, respectively, calculate the temporal DWT of even and odd numbered columns of subbands (vlh or vhh )

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 13

in time multiplexed form. Each subcell-1 calculates a pair of components corresponding to two subbands of the 3-D transform
1 1 1 1 in every cycle such that in every cycle, a pair of components of two adjacent columns of four subbands (zlll , zllh , zlhl , zlhh ) 1 1 1 1 or (zhll , zhlh , zhhl , zhhh ) are obtained from four such subcells. A pair of components of two adjacent columns of all the eight

oriented selective subbands of 1-level 3-D DWT are obtained in a couple of cycles. Two columns of each of the eight subbands are obtained from PU-1 in M cycles and the entire coefcient matrix of 1-level 3-D DWT of the input frames of size (M N ) can be obtained in M N/4 cycles with an initial latency of 4 cycles.
zlll1(n1,n2,2n3-1) zlll1(n1,n2,2n3) SR 1 SR 2 R R
ul2(n1,n2,n3) n

R R MUX MUX

R R MUX

R R MUX

Subcell 1
uh2(n1,n2, n3)

DMUX

DMUX

Output 1

Output 2

2 2 2 2 Fig. 8. Structure of PU-2. Output-1 represent (vll (n1 , m2 , n3 ) or vhl (n1 , m2 , n3 )) and output-2 represent (vlh (n1 , m2 , n3 ) or vhh (n1 , m2 , n3 ), where 0 m2 (M/4) 1, 0 n3 (N/4) 1 and 0 n2 (M/2) 1

1 Components of the subband zlll are send to PU-2 to calculate the DWT components of second-level. PU-2 receives a pair of 1 components from PU-1 corresponding to a pair of adjacent columns of zlll in every cycle after a gap of one cycle. The structure

of PU-2 is shown in Fig. 8. It consists of one input-delay-unit (IDU) and one subcell-1. Subcell-1 in this case perform row and
1 column computations pertaining to 2-level DWT in time-multiplexed form. The components of zlll of a particular frame are

fed to subcell-1 through the IDU of PU-2 in block-by-block, similar to 1-level processing. The input block I(n1 , n2 , n3 ) in this
1 1 1 1 case contains four consecutive samples {zlll (n1 , n2 , 2n3 + 3), zlll (n1 , n2 , 2n3 + 2), zlll (n1 , n2 , 2n3 + 1), zlll (n1 , n2 , 2n3 )} for

0 n2 M/2 1, 0 n3 N/4 1. Input blocks are fed to subcell-1 (see Fig.8) column-wise after a gap of one cycle such
1 that one column of input block of a particular frame of zlll are fed to subcell-1 in M cycles and input blocks of one complete 1 frame in M N/4 cycles. One column of input-block is derived from four successive columns of zlll . PU-2 receive components 1 1 of two adjacent columns of zlll from PU-1, two previous columns of zlll are required to be stored to derive the required

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

In nput-dela ay-unit (IDU) (

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 14

input blocks. The IDU, therefore, contains 2 shift-registers (SRs) (of size M/2 words each). The SRs also help to calculate downsampled lter computation along the row direction. A pair of DWT components u2 (n1 , n2 , n3 ), and u2 (n1 , n2 , n3 ) are l h obtained from the subcell in every alternate cycle. Note that the successive output samples (u2 (n1 , n2 , n3 ), and u2 (n1 , n2 , n3 )) l h are corresponding to successive columns of intermediate coefcient matrix ([u2 ] and [u2 ]). The column-DWT can be performed l h
1 on the components of ([u2 ] and [u2 ]) immediately in the next cycle. Since subcell-1 of PU-2 receives the input blocks of zlll l h

only during alternate cycles, it remains idle for one cycle after every input cycle of I(n1 , n2 , n3 ). The idle cycles of subcell-1 can be utilized by assigning down-sampled lter computation of ([u2 ] and [u2 ]) in time-multiplexed form. The samples of l h u2 , and u2 are passed through separate delay-path to provide the column delay necessary for the lter computation. All the l h registers and shift registers of IDU are clocked by CLK2 whose frequency is half of the frequency of CLK1 used by PU-1. The 4 multiplexors (MUX) of IDU select the delayed samples of u2 , and u2 alternately and fed them to the subcell during its l h
2 2 2 2 1 idle cycles. A pair of DWT components of two subbands (vll , vlh ) or (vhl , vhh ) of a particular frame of zlll are obtained from

PU-2 after every couple of cycles and one component each of four subbands in every four cycles. One column of each of the four subbands are obtained in M cycles and subband components of a complete frame in N M/4 cycles. The components of
2 2 2 2 subbands (vll , vlh ) or (vhl , vhh ) are sent to PU-3 to calculate inter-frame DWT computation of 2-level decomposition.

vl2
Addr_1 CLK_1
MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB 7

vh2
sel_3
MUX MUX MUX MUX

Fig. 9.

Structure of frame-buffer-1

To calculate temporal-DWT of 2-level, subbands of 3 successive frames are stored in frame-buffer-1. The structure of frame-buffer-1 is shown in Fig.9. It consists of 7 memory-blocks (MBs) and four 2-to-1 line MUXs. Each MB is of size
2 2 2 2 of M N/8 words. Components of a pair of subbands (vll , vhl ) or (vlh , vhh ) corresponding to a particular frame are stored in 2 2 2 2 alternate MBs, such that, the components of (vll , vhl ) are stored in even-numbered MBs and those of (vlh , vhh ) are stored 2 2 in odd-numbered MBs. One extra MB is used to store one extra frame of (vlh , vhh ) to provide one complete frame-delay

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 15

vl2 vh 2 zlll2 ul3 uh 3 SR 1

Frame Buffer 1 vl3 vh 3 R SR 2 SR 3


MUX1 MUX2 MUX3 MUX4 MUX1 MUX2 MUX3 MUX4

Frame B ff 2 F Buffer R SR 4 SR 5
MUX1 MUX2 MUX3 MUX4

R SR 6 SR 7
MUX1 MUX2 MUX3 MUX4

sel_3 sel_2 sel 2

sel_4 sel_1

Subcell 1
sel_1 sel_2

DMUX ARRAY

sel_3 sel_4

zl3

zh3 zl2

zh 2

2 2 2 2 2 2 3 3 Fig. 10. Structure of PU-3. Input vl and vh , respectively, represent, (vll , vhl ) and (vlh , vhh ). Intermediate results vl and vh , respectively, represent, 3 3 3 3 2 2 2 2 2 2 2 2 2 3 3 (vll , vhl ) and (vlh , vhh ). Output zl and zh , respectively, represent (zhll , zlhl , zhhl ) and (zllh , zlhh , zhlh , zhhh ). Similarly, output zl and zh , respectively, 3 3 3 3 3 3 3 3 represent (zlll , zhll , zlhl , zhhl ) and (zllh , zlhh , zhlh , zhhh ).

TABLE VI T IMING SCHEDULE FOR MULTIPLEXING DWT DWT 2-level Temporal 3-level Column 3-level Row 3-level Temporal SB
2 vhl 2 vll 2 vll

COMPUTATION IN THE SUBCELL OF

PU-3

clock cycles (2m1 + 2n + 1 + 1 ) (2m1 + 2n + 3 + 1 ) ((2m1 + 1) + 2n + 1 ) ((2m1 + 1) + 2n + 3 + 1 ) (2m1 + 4n + 2 + 2 ) (2m1 + 4n + 2 + 2 ) (2m1 + 2M m2 + 8n + 4 + 3 ) (2m1 + (2m2 + 1)M + 8n + 4 + 3 ) (4m1 + 2M m2 + 8n + 6 + 4 ) (4m1 + M (2m2 + 1) + 8n + 6 + 4 ) ((4m1 + 2) + 2M m2 + 8n + 6 + 4 ) ((4m1 + 2) + M (2m2 + 1) + 8n + 6 + 4 )

2 vlh 2 zlll 2 zlll

3 vll 3 vhl 3 vlh 3 vhh

u3 h

u3 l

LEGEND: SB: subband, = M N/8, n = 0, 1, 2, ...., M 1, m2 = 0, 1, 2, ....., (N/8) 1, m1 = 0, 1, 2, 3...., 1 = 3M N/8 cycles delay to ll the 2 frame-buffer-1, 2 = 12 cycles delay to ll the delay-path of zlll , 3 = 3M cycles delay to ll the shift-registers corresponding to u3 or u3 , 4 = 6M N/8 l h cycles delay to ll the frame-buffer-2.

2 2 2 2 with respect to the subband components of (vll , vhl ). Four MUXs of frame-buffer-1 select the frames of (vll , vhl ) from the

even-numbered MBs and the current frame during each even-numbered sets of M N/8 cycles, while they select the frames of
2 2 (vlh , vhh ) from the odd-numbered MBs during each odd-numbered sets of M N/8 cycles. The subcell of PU-3 (as shown in

Fig.10) receives a block of 4 samples from the frame-buffer-1 through the MUXs during every alternate cycles and calculates
2 2 2 2 inter-frame DWT of (vll , vhl ) and (vlh , vhh ) in alternate periods of M N/8 cycles. The structure of this subcell is identical to 2 2 2 2 the structure of subcell-1 of PU-1. Components of four subbands (zlll , zllh ) and (zhll , zhlh ) are obtained in time-multiplexed

during even-numbered sets of M N/8 cycles. Similarly, during the odd-numbered period of M N/8 cycles, components of other
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 16

TABLE VII I NPUT-O UTPUT DATA FLOW clock cycle 1 2 3 : M +1 M +2 M +3 : 2 + 1 2 + 2 2 + 3 : 2 + M + 1 2 + M + 2 2 + M + 3 : Input


2 zlll (0, 0, 0) 2 vhl (0, 0, 0) 2 vll (0, 0, 0)

OF THE SUBCELL OF

PU-3 Input output-1


2 zlhl (0, 0, 0)

output-1
2 zlll (0, 0, 0) 2 zhll (0, 0, 0)

output-2
2 zllh (0, 0, 0) 2 zhlh (0, 0, 0)

clock cycle +1 +2 +3 : +M +1 +M +2 +M +3 : 3 + 1 3 + 2 3 + 3 : 3 + M + 1 3 + M + 2 3 + M + 3 :

output-2
2 zlhh (0, 0, 0)

u3 (0, 0, 0) l :

u3 (0, 0, 0) h :

2 vlh (0, 0, 0)

2 vhh (0, 0, 0)

2 zhhl (0, 0, 0)

2 zhhh (0, 0, 0)

:
2 zlll (0, 0, 1) 2 vhl (0, 0, 1) 2 vll (0, 0, 1)

:
2 vlh (0, 0, 1)

:
2 zlhl (0, 0, 1)

:
2 zlhh (0, 0, 1)

2 zlll (0, 0, 1) 2 zhll (0, 0, 1)

u3 (0, 0, 1) l :

2 zllh (0, 0, 1) 2 zhlh (0, 0, 1)

u3 (0, 0, 1) h :

2 vhh (0, 0, 1)

2 zhhl (0, 0, 1)

2 zhhh (0, 0, 1)

:
2 zlll (2, 0, 0) 2 vhl (2, 0, 0) 2 vll (2, 0, 0)

:
2 vlh (2, 0, 0)

:
2 zlhl (2, 0, 0)

:
2 zlhh (2, 0, 0)

2 zlll (2, 0, 0) 2 zhll (2, 0, 0)

u3 (2, 0, 0) l :

2 zllh (2, 0, 0) 2 zhlh (2, 0, 0)

u3 (2, 0, 0) h :

2 vhh (2, 0, 0)

2 zhhl (2, 0, 0)

2 zhhh (2, 0, 0)

:
2 zlll (2, 0, 1) 2 vhl (2, 0, 1) 2 vll (2, 0, 1)

:
2 vlh (2, 0, 1)

:
2 zlhl (2, 0, 1)

:
2 zlhh (2, 0, 1)

2 zlll (2, 0, 1) 2 zhll (2, 0, 1)

u3 (2, 0, 1) l :

2 zllh (2, 0, 1) 2 zhlh (2, 0, 1)

u3 (2, 0, 1) h :

2 vhh (2, 0, 1)

2 zhhl (2, 0, 1)

2 zhhh (2, 0, 1)

= M N/8. Input sample corresponding to the lter output is only shown in the input column. We have not counted the clock cycles involved to ll the delay registers/shift-registers/memory-blocks.

2 2 2 2 2 four subbands (zlhl , zlhh ) and (zhhl , zhhh ) are obtained in time-multiplexed form. One component of zlll obtained from the 2 subcell-1 (of PU-3) after every 4 cycles and the successive components belong to a column. Successive columns of zlll are 2 obtained from subcell-1 during alternate periods of M N/8 cycles. Subband zlll is further transformed to generate the DWT

coefcients of level-3.
2 2 2 2 Since subcell-1 of PU-3 receives the components of (vll , vhl ) or (vlh , vhh ) during alternate cycles, it remains idle for one 2 2 2 2 2 cycle after every input cycle of (vll , vhl ) or (vlh , vhh ). The DWT of zlll can be computed by the subcell during the idle 2 cycles without any data overlapping, since the amount of computation required to process zlll is (3/8)-th of the amount of

temporal-DWT of second-level. By computing temporal-DWT of second-level alone, hardware utilization of the subcell is only
2 50%. The processing of zlll can be time-multiplexed with that of second-level computation without any data overlapping. DWT 2 computation of zlll are scheduled at the idle cycle of subcell-1 of PU-3. The processing of the intermediate coefcients u3 l

and u3 are time-multiplexed column-wise to take the advantage of down-sampling. Similarly the temporal-DWT of a pair of h
3 3 3 3 subbands (vll , vhl ) and (vlh , vhh ) are time-multiplexed to take the advantage of down-sampling along the temporal direction.

Schedule for multiplexing the computation of third-level DWT and second-level temporal-DWT in subcell-1 is given in Table VI. Input-output data-ow of subcell-1 of Fig.10 is derived for few cycles using the schedule of Table VI and shown in
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 17

Table VII. The registers of Fig.10 provides the required delay in column-wise processing, while the shift-registers provides the necessary delay in row-wise processing to the intermediate coefcients u3 and u3 . The extra shift-register provides one l h additional row-delay for time-multiplexing the processing of u3 and u3 . Similarly, frame-buffer-2 provides the necessary framel h
3 3 3 3 delay for the multiplexed computation of temporal-DWT (vll , vhl ) and (vlh , vhh ). The structure of frame-buffer-2 is similar to

the structure of frame-buffer-1 (see Fig.9) except that in this case each MB of size M N/32 words. Each shift-registers of PU-3
2 is of size M/8 words (equal to half of the frame height of zlll ) and clocked by CLK4 which is 8 times slower than CLK1.

Each registers of the delay-path are clocked by a separate clock CLK3 which is 4 times slower than CLK1. PU-3 uses separate multiplexors for multiplexing the computations. Four MUX1es multiplexes the computation of u3 and u3 , while four MUX2es l h
2 multiplexes the row and column processing of zlll . Similarly, four MUX3es multiplexes the temporal DWT computation with 2 2 the row and column processing of zlll . Four MUX4es multiplexes computation of zlll with temporal DWT computation of

second-level. Each PU works in separate pipeline stage and computes multilevel 3-D computation concurrently. The proposed structure can compute 3-level running DWT of a video stream of frame size (M N ) and frame rate R in M N R/8 cycles with initial latency of (11 + 2M + 1 + 2 + 3 + 4) cycles, where a delay of (6 + 2M ) cycles introduced to ll the register and shift-register of PU-2, and (1 + 2 + 3 + 4 ) cycles delay is introduced to ll the MBs of frame-buffer-1, registers, shift-registers and MBs of frame-buffer-2 of PU-3. IV. I MPLEMENTATION OF S UBCELLS To have a reduced-hardware structure, subcell-1 and subcell-2 of the PUs can be implemented by multiple constant multiplication methods (MCM) using CSD-representation of the lter coefcients [19] or memory-based technique for multiplications using look-up-tables and adders [20]. Apart from that, the interrelation and symmetries between the coefcients of wavelet lter bases can be utilized to derive efcient structures of the subcells. We discuss here an optimal area-time efcient implementation of the subcells for the Daubechies wavelet lters for K = 4. The transfer function of the low pass and the high pass lters corresponding to the Daubechies 4-tap wavelet transform can be expressed as [21]: H(z) = a + bz 1 + cz 2 + dz 3 G(z) = d cz 1 + bz 2 az 3 where a = 1+ 23 , b = 3+ 23 , c = 3 23 , d = 1 23 4 4 4 4

(1a) (1b)

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 18

TABLE VIII C OMPARISON OF H ARDWARE - AND T IME -C OMPLEXITIES OF THE P ROPOSED S TRUCTURE AND THE E XISTING S TRUCTURES FOR 3- LEVEL 3-D DWT
USING

DAUBECHIES 4- TAP WAVELET FILTER . M : I MAGE HEIGHT, N : I MAGE WIDTH , R: FRAME - RATE ADD 18 6 18 72 (657/64)Q 276 REG 24 110 8 32 5.25N 239 shift-register 2M N +2M N R 0 2(2M + 1)N 4(N + 2)R 147M N/32 15M/8 frame-buffer
1 MNR 8

structures Weeks et al [8] (3DW-I) Weeks et al [8] (3DW-II) Das et al [13] Dai et al [15] Mohanty et al [16] Proposed Structure

MULT 24 8 24 96 (219/16)Q 44

cycle period TM + TA TM + TA TM + TA TM + TA TM + 2TA max(TM , 2TA )

ACT
4 M N Rx 7

Latency O(M N R) O(M N R) O(M N R) O(M N R) O(M N ) O(M N )

MNR
1 MNR 8 1 MNR 8

4M N Rx
4 M N Rx 7 1 M N Rx 7

5M N/32 35M N/32

M N R/Q M N R/8

Legend: MULT: multiplier, ADD: adder, REG: data/pipeline register, shift-register and frame-buffer are represented in words and ACT in cyclesx = 511/512, Q: input block-size.

Ignoring the xed factor (4 2) in the denominators of the lter coefcients, the low pass and high pass outputs corresponding to the input sequence (x0 , x1 , x2 , x3 ) may be expressed otherwise in alternative form given by (2) in the following: (2a) ul = (p1 + 2p2 + p3 ) + 3(p1 p3 ) (2b) uh = (q1 + 2q2 + q3 ) + 3(q3 q1 )

where p1 = (x0 + x1 ); p2 = (x1 + x2 ); p3 = (x2 + x3 ); q1 = (x0 x1 ); q2 = (x2 x1 ), and q3 = (x2 x3 );

Unlike the subcell-1, subcell-2 performs down-sampled lter computation on two signals (ul (n) and uh (n)) in time-multiplexed form. The low-pass and high-pass lter outputs corresponding to (ul (n) and uh (n)) in this case can otherwise be expressed in an alternative form: vl = (2 + r)s(n) + (4 + r)t(n 1) + (2 r)s(n 2) rt(n 3) vh = rt(n) + (r 2)s(n 1) + (r + 4)t(n 2) (2 + r)s(n 3)

(3a) (3b)

where r =

3 1, s(n) and t(n), respectively represent the outputs of the LC (see Fig.5 of [16]); where input X1 = ul (n)

and X2 = uh (n). vl , respectively, represent vll and vhl when s(n) equal to ul and uh , respectively. Similarly, vh , respectively, represent vhh and vlh when t(n) equal to uh and ul , respectively. Equation (3) further may be expressed in z-domain as: Vl (z) = S(z)(2 + r + (2 r)z 2 )+ T (z)((4 + r)z 1 rz 3 ) Vh (z) = T (z)(r + (r + 4)z 2 + S(z)((r 2)z 1 (2 + r)z 3 )

(4a) (4b)

where, S(z) and T (z) represent z-transform of s(n) and t(n), respectively.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 19

Each of the pair of lter outputs given by (2) and (4) involves only two multiplications. Using (2) and (4), the subcells can, therefore, be implemented in fully-pipelined structures. The proposed structures of the subcells are shown in Fig.11 and Fig.12. The structure of subcell-1, as shown in Fig.11, computes a pair of lter output during every cycle period according to (2). It consists of seven adder units (AU) and two multipliers. Each AU performs a pair of additions. AU-1, AU-2 and AU-3 compute (p1 , q1 ), (p2 , q2 ) and (p3 , q3 ), while AU-4 and AU-5 compute (2p2 + p3 , 2q2 + q3 ) and (p1 p3 , q3 q1 ), respectively. AU-6 computes (p1 + 2p2 + p3 , q1 + 2q2 + q3 ). The entire computation of subcell-1 is performed in three pipeline stages. A pair of lter output is given out by AU-7 after a latency of 2 cycles, where the duration of a cycle period T = max(TM , 2TA ), where TM and TA , respectively, the time required for one multiplication and addition operation in the subcells. The structure of subcell-2 has two multipliers, three shifters, 10 adders. It would yield a pair of lter output in every cycle period, after an initial latency of 4 cycles.
x0 AU 1 p1 AU 5 q3- q1
3

x1 q1 AU 2 p2 q2

x2 AU 3 p3 q3

x3

p1- p3
3

AU 4 2p 2 2+ p3 2q 2 2 + q3 AU 6 p1+2p2+ p3

AU 7 yh yl

q1+ 2q2+ q3

Fig. 11.

Structure of subcell-1 for the Daubechies 4-tap wavelet lter.

r ul(n) uh(n) LC s t

rs 2s (2+r)s (2-r)s

1<< r rt

1<< 1 (4+r)t (r-2)s -

(2+r r)s vl(n) vh(n)

1<<

2t

Fig. 12.

Structure and function of subcell-2 for the Daubechies 4-tap wavelet lter.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 20

V. H ARDWARE COMPLEXITY AND PERFORMANCE CONSIDERATION

In this section we discuss the details of the hardware and time complexities of the proposed structure and compare those with the existing designs.

A. Hardware Complexity

The proposed structure is comprised of three PUs. PU-1 has four PEs and four subcells (subcell-1) where, each PE is comprised of two pair of subcell-1 and subcell-2. Each subcell-1 has 2 multipliers, 14 adders and 10 pipeline registers while, each subcell-2 requires 2 multipliers, 10 adders and 10 pipeline registers. Each PE requires 8 multipliers, 48 adders and 40 pipeline registers. PU-1 therefore, involves 40 multipliers, 248 adders and 208 pipeline registers. PU-2 is comprised of one subcell and one IDU. IDU consists two shift registers of size M/2 words each, 8 registers, 4 MUXs and 2 DMUXs. PU-2, therefore, involves 2 multiplier, 14 adders, (16 + M ) registers, 4 MUXs and 2 DMUXs. PU-3 is comprised of one subcell-1, one frame-buffer-1 and one frame-buffer-2, 7 shift-registers of size M/8 words, 3 registers, 16 2-to-1 line MUXs and one DMUX array. Frame-buffer-1 is comprised of 7 MBs of size M N/8 words each and 4 2-to-1 line MUXs. Similarly framebuffer-2 is comprised of 7 MBs of size M N/32 words each and 4 2-to-1 line MUXs. The DMUX array is comprised of 14 1-to-2 line DMUXs. The proposed structure, therefore, involves 44 multipliers, 276 adders, 239 data/pipeline registers, 15M/8 shift-register words, (35/32)M N frame-buffer words, 28 MUXs and 16 DMUXs.

B. Time Complexity

The proposed structure calculates DWT coefcients of a four samples of each frame in every cycle and two frames are transformed in parallel. The average computation time (ACT) to calculate 3-level DWT of a video stream of frame size (M N ) and frame rate R is M N R/8 cycles. The structure has initial latency of (23 + 5M + 7M N/4) cycles. Out of this, (12 + 5M + 7M N/4) cycles of delay is introduced to ll the registers, shift-registers and MBs of the frame-buffers. The duration of one cycle period is T = max(TM , 2TA ), where TM and TA is the time required to perform one multiplication and addition in a subcell.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 21

C. Performance Comparison

The hardware and time complexity of the proposed structure and the existing structures of [8], [13], [15], [16] are listed in Table VIII in terms of cycle period, ACT
1

and latency in clock cycles, registers, shift-register and frame-buffer size in words,

along with the number of multipliers and adders for comparison.


TABLE IX C OMPARISON OF M EMORY
AND

T IME C OMPLEXITY OF THE P ROPOSED STRUCTURE AND STRUCTURES OF [15] Structure Dai et al [15] Frame-size 176 144 FPS 15 30 60 15 Mohanty et al [16] 176 144 30 60 15 Proposed 176 144 30 60 15 Dai et al [15] 640 480 30 60 15 Mohanty et al [16] 640 480 30 60 15 Proposed 640 480 30 60 Memory 56312 112592 225152 121140 121140 121140 28289 28289 28289 604952 1609872 2419712 1461720 1461720 1461720 337439 337439 337439 ACT 54203 108405 216810 23760 47520 95040 47520 95040 190080 657000 1314000 2628000 288000 576000 1152000 576000 1152000 2304000

AND

[16] FOR DWT L EVEL J = 3

memory in unit of words and ACT in unit of cycles.

The structures of [8], [13], [15] are of folded type, which compute the multilevel 3-D DWT level-by-level while the structure of [16] compute multilevel 3-D DWT concurrently. As shown in Table VIII, the structure of [15] is the most efcient among the existing folded structures. Compared with [15], the proposed structure involves 2.18 times less multipliers, nearly 23/6 times more adders, (32N R/15M ) times less on-chip memory (sum of data/pipeline register and shift-register words) and 4R/35 times less frame-buffer than [15]. Besides, it has 8M N R/7 times less ACT than [15]. Compared with the structure of [16], the proposed one involves 0.311Q times less multipliers, (26.88/Q) times more adders, nearly 2.45N times less on-chip memory and 7 times more frame-buffer. It involves Q/8 times more ACT than the structure of [16]. The proposed structure has small cycle period compared to the existing structures. It is interesting to note that, the on-chip memory of the proposed structure
1 ACT is the number of cycles required for the computation of all the J-levels of 3-D DWT after the initial latency. ACT of the structure of [13] and [15], is calculated by the sum of the ACTs of each individual levels, because they compute the 3-D DWT of different levels sequentially. In case of the proposed structure and the structure of [16] ACT is calculated by dividing the total number of 3-D DWT coefcients by the throughput per cycle

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 22

varies with M while in case of [15] it varies with N R and in case of [16] it varies with M N . This results a signicant saving in memory since the frame-size and frame-rate of video applications is usually very large.

We have estimated memory complexity and ACT of the proposed structure and the existing structures [15], [16] pertaining to some practically used video frames and frame-rates. The memory-complexity of the structures represent the sum of the data/pipeline registers and memory words required by the shift-register and frame-buffer. We have assumed input-block size Q = 16 for [16] and estimated memory complexity and ACT of [15], [16] and the proposed structure for 3-level DWT. The estimated values are listed in Table IX for comparison. It can be found from Table IX that, memory complexities of the proposed structure and that of [16] are independent of the frame-rate while in case of [15] it increases proportionately with frame-rate. Compared with [15], proposed structure for frame-size 176 144 and frame-rates 15, 30 and 60, respectively, requires 1.99 times, 3.98 times and 7.96 times less memory words and involves 12.3% less ACT. It involves 4.28 times less memory words than those of [16] for the same frame-size and frame-rates and involves 2 times more ACT than other. For frame-size 640 480 and frame-rates 15, 30 and 60, the proposed structure involves 1.79 times, 4.77 times and 7.17 times less memory words than those of [15] and calculate 3-level DWT in 12.3% less time than [15]. Compared with [16], the proposed one involves 4.33 times less memory words than those of [16].

D. Numerical Error Consideration

To validate the proposed design we have coded it in MATLAB 7.1 and VHDL for decomposition level-1 for oating point and xed point implementations, resepectively. For xed-point implementation, we have taken 8-bit pixel values and 12-bit precision for all the intermediate signals. We have used 11-bit Baugh-Wooley multiplier for the RP and 12-bit multiplier for the CP and TP. We have processed four successive frames of two video sequences Foreman and Xylophone to generate all the 8 subbands of 3-D DWT, and estimated absolute errors in xed point implementation as the difference between the MATLAB simulation and test-bench results. The average and maximum errors obtained for all the subbands of Foreman and Xylophone video sequences are shown in Table X. We nd that the average error in LLL subband is 0.49% of its average value in case of Foreman and 0.87% in case of Xylophone.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 23

TABLE X N UMERICAL E RROR OF 1-L EVEL 3-D DWT Subbands LLL HLL LLH HLH LHL HHL LHH HHH Foreman Avg. Error 1.9797 2.2756 0.6019 0.6468 1.3203 1.138 0.4495 0.4866 Max. Error 3.0987 3.6129 1.2487 1.1638 2.426 1.8319 1.1286 0.9639 Xylophone Avg. Error 1.8376 1.7068 0.7493 0.6285 0.8831 0.9561 0.4783 0.5528 Max. Error 2.7030 2.8422 1.5044 1.4232 2.8559 1.6683 1.0656 2.0059

E. Synthesis Result

We have coded the proposed design in VHDL for frame-size 176 144 and 640 480; and frame rates 15, 30, 60, and synthesized using Xilinx ISE 12.1i tools along with the best of the existing designs [15] and [16]. We have considered Daub-4 wavelet lter for all the designs and coded the structure for 3-level DWT. We have used single port block RAM (BRAM) for implementing the temporal-buffer of all the designs. The frame-buffer of [15] and the input-buffer of [16] is also implemented using single-port BRAM. We have implemented all the registers and shift-registers transposition-buffer using delay-type ipop. All the designs are synthesized for the FPGA device 6VLX760FF1760-2 and the results obtained from the synthesis report are listed in Table XI. We have estimated the parameter slice delay-product (SDP) to measure area-time complexity of the designs in FPGA platform. SDP is dened as the product of number of slices required by each design and the computation time (CT), where CT = Number of cycles required for computation / Maximum usable frequency (MUF).

The proposed design has lowest clock period and involves less memory (in terms of BRAMs) as expected from the theoretical estimation shown in comparison Table VIII and Table IX. Although the proposed one involves less than half of the multipliers of [15], it involves 25% more slices than [15] due to adder complexity, pipeline registers and data selectors (MUX, DMUX). As shown in Table XI, hardware complexity of the proposed structure and the structure of [16] is independent of the framerate while in case of [15], memory complexity almost increases proportionately with frame-rate. The proposed structure for frame-size 176 144 and frame-rate 15, 30 and 60, respectively, involves 2.56 times, 5.12 times and 9.6 times less BRAMs than those of [15] and offers 2 times higher throughput rate than [15]. Compared with the structure of [16] the proposed one
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 24

TABLE XI C OMPARISON OF S YNTHESIS R ESULTS OF THE PROPOSED STRUCTURE AND STRUCTURES OF [15], [16] Structure frame-size Dai [15] 176 144 Mohanty [16] 176 144 Proposed 176 144 Dai [15] Proposed 640 480 FPS 15 30 60 15 30 60 15 30 60 15 15 30 60 Slices 10751 10467 10607 28581 28581 28581 13495 13495 13495 14035 13662 13662 13662 BR 64 128 240 48 48 48 25 25 25 544 384 384 384 MUF (MHz) 50.077 52.578 50.121 40.21 40.21 40.21 88.096 88.096 88.096 50.036 88.096 88.096 88.096 ACT (ms) 1.08 2.06 4.32 0.59 1.18 2.36 0.539 1.07 2.15 13.13 6.53 13.07 26.15

FOR

FPGA

DEVICE

6VLX760FF1760-2

SDP (s) 11.61 21.56 45.82 16.862 33.725 67.45 7.27 14.54 29.09 184.27 89.21 178.42 356.85

Legend: BR: block RAM, SDP: slice delay product, FPS: frame per second, MUF: maximum usable clock frequency, SDP is dened as the product of number of slices required by each design and the computation time (CT), where CT = Number of cycles required for computation / Maximum usable frequency. (MUF)).

involves 2.11 times less slices, 1.92 times less BRAMs than those of [16] and offers nearly same throughput rate for the same frame-size and frame rates. The proposed one has 1.59 times less SDP than the structure of [15] and 2.3 times less SDP than that of [16]. For the frame-size 640 480 and frame rate 15, the proposed structure involves 1.41 times less BRAMs and 2.06 times less SDP than those of [15]. F. Comparison of Power Consumption We have estimated the power consumption of the proposed design and the design of [15] and [16] using Xilinx Xpower tools by implementing them in FPGA device 6VLX760FF1760-2. Xpower analyzer report for 40 MHz clock frequency is listed in Table XII for comparison. As shown in Table XII, for frame-size 176 144 and frame rates 15, 30 and 60, the proposed structure dissipates, respectively, 9.4%, 31.68% and 74.7% less dynamic power than the structure of [15]. Compared with [16], the proposed one dissipates 71.9% less dynamic power. It dissipates 44.6% less dynamic power than that of [15] for frame-size 640 480 and frame rate 15. This is mainly due to less number of BRAM used by the proposed structure than others. VI. C ONCLUSIONS We have shown that using overlapped grouping of frames and by appropriate scheduling of computation of different levels, the memory requirement of multilevel 3-D DWT structures could be drastically reduced. Based on this observation, we have
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 25

TABLE XII C OMPARISON OF P OWER C ONSUMPTION Structure Frame-size FPS 15 Dai [15] 176 144 Mohanty [16] Proposed Dai [15] Proposed 640 480 30 60 15, 30, 60 15, 30, 60 15 15, 30, 60 Power (watt) Static 3.213 3.214 3.217 3.229 3.212 3.234 3.221 Dynamic 0.202 0.247 0.334 0.652 0.183 0.776 0.430

suggested a computing scheme to reduce the memory complexity of 3-D DWT implementation. The remarkable feature of the proposed structure is that, it does not involve any line-buffer or frame-buffer for level-1. Compared with the best of the existing folded structures [15], proposed design involves signicantly less on-chip memory, less frame-buffer and less ACT. Compared with [16], which could be taken as the best among all the existing designs, the proposed one involves 0.311Q times less multipliers and (26.88/Q) times more adders and involves Q/8 times more ACT, where Q is the input block-size. However, it involves 2.45N times less on-chip memory than other. For frame-size 176 144 and frame-rate 60, the proposed structure involves 7.96 times less memory and 12.3% less ACT than [15]. Compared with [16] for input-block size (Q = 16), proposed structure involves 4.28 times less memory and involves double the ACT for the same frame size and frame-rates. The synthesis result for FPGA device 6VLX760FF1760-2 shows that proposed structure for frame size 176 144 and frame rate 60, involves 9.6 times less BRAMs than those of [15] and offers 2 times higher throughput rate than [15]. Compared with the structure of [16], proposed one involves 1.9 times less BRAMs and offers nearly the same throughput rate. The proposed structure has signicantly less SDP than the existing structures. Due to its less memory complexity, proposed structure consumes less dynamic power than best of the existing structures. It can compute multilevel running 3-D DWT on an innite GOFs frames and involves much less memory and resource than the existing designs. It could, therefore, be used for high-performance video processing applications. R EFERENCES
[1] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, 3-D wavelet coding of video with arbitrary regions of support, IEEE Trans. Circuit Syst. Video Technol., vol. 11, pp.1063-1068, Sept.2001. [2] A. M. Baskurt, H. Benoit-Cattin, and C. Odet, 3D medical image coding method using a separable 3D wavelet transform, SPIE Proceedings on Medical Imaging 1995: Image Display, vol. 2431, pp. 173183, Apr. 1995. [3] V. Sanchez, P. Nasiopoulos, and R. Abugharbieh, Lossless compression of 4D medical images using H.264/AVC, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2006), vol. II, May 2006, pp. 11161119. [4] J. Wei, P. Saipetch, R. K. Panwar, D. Chen, and B. K. Ho, Volumetric image compression by 3D discrete wavelet transform (DWT), SPIE Proceedings on Medical Imaging 1995: Image Display, vol. 2431, pp. 184194, Apr. 1995.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 26

[5] L. Anqiang and L. Jing, A novel scheme for robust video watermark in the 3D-DWT domain, in International Symposium on Data, Privacy, and E-Commerce (ISDPE 2007), Nov. 2007, pp. 514516. [6] J. -R. Ohm, M. van der Schaar, and J. W. Woods, Interframe wavelet coding: Motion picture representation for universal scalability, J. Signal Process. Image Commun., vol. 19, no. 9, pp. 877908, Oct. 2004. [7] P.-C. Wu and L.-G. Chen, An efcient architecture for tow-dimensional discrete wavelet transform, IEEE Trans., Circuit and System for Video Technology, vol. 11, no. 4, pp.536545, Apr. 2001. [8] M. Weeks, and M. A. Bayoumi, Three-dimensional discrete wavelet transform architecture, IEEE Trans. on Signal Processing vol.50, no.8, pp.20502063, Aug. 2002. [9] M. Weeks, and M. A. Bayoumi, Wavelet transform: architecture, design and performance issues,, Journal of VLSI Signal Processing vol.35, Issue 2, pp.155-178, Sept. 2003. [10] W. Badawy, M. Talley, G. Zhang, M. Weeks, and M. A. Bayoumi, Low power very large scale integration prototype for three-dimensional discrete wavelet transform processor with medical application, Journal of Electronic Imaging vol.12, no.2, pp.270-277, April 2003. [11] B. Das, A. Hazra and S. Banerjee, An efcient architecture for 3-D discrete wavelet transform, IEEE Trans. Circuit and Syst., Video Techno., vol. 20, no. 2, pp. 286296, Feb. 2010. [12] B. Das and S. Banerjee, Low power architecture of running 3-D wavelet transform for medical imaging application, in Proc. Eng. Med. Biol. Soc./Biomed. Eng. Soc. Conf., vol. 2. 2002, pp. 10621063. [13] B. Das and S. Banerjee, A memory efcient 3D DWT architecutre, Proceeding of 16th International Conference on VLSI Design, IEEE Computer Society Aug. 2003. [14] B. Das and S. Banerjee, Data-folded architecture for running 3-D DWT using 4-tap Daubechies lters, IEE Proc. Circuits Devices Syst., vol. 152, no. 1, pp. 1724, Feb. 2005. [15] Q. Dai, X. Chen and C. Lin, A novel VLSI architecture for multidimensional discrete wavelet transform, IEEE Trans. Circuit and Syst., Video Techno., vol.14, no.8, pp.1105-1110, Aug. 2004. [16] B. K. Mohanty and P. K. Meher, Parallel and pipeline architecture for high-throughput computation of multilevel 3-D DWT, IEEE Trans. Circuit and Syst., Video Techno., vol.20, No.9, pp.1200-1209, Sept.2010. [17] Z. Taghavi and S. Kasaei, A memory efcient algorithm for multidimensional wavelet transform based on lifting, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), vol. 6. 2003, pp. 401404. [18] P. K. Meher, B. K. Mohanty and J. C. Patra Hardware-efcient systolic-like modular design for two-dimensional discrete wavelet transform, IEEE Trans. on Circuits and Syst. II, Express Brief vol. 55, no. 2, pp. 151-154, Feb 2008. [19] R. I. Hartley, Subexpression sharing in lters using canonic signed digit multipliers, IEEE Trans Circuits and Syst. II: Analog and Digital Signal Processing, vol. 43, no. 10, pp. 677688, Oct. 1996. [20] H. -R. Lee, C. -W. Jen, and C. -M. Liu, On the design automation of the memory-based VLSI architectures for FIR lters, IEEE Trans. Consumer Electronics, vol. 39, no. 3, pp. 619629, Aug. 1993. [21] I. Daubechies and W. Sweldens, Orthonormal bases of compactly supported wavelets, Comm. Pure Appl. Math., vol. 41, pp. 909996, 1988.

Basant K Mohanty (M06) received B.Sc and M.Sc degree (both with rst-class honors) in Physics from Sambalpur University, Orissa, in 1987 and 1989, respectively. Received Ph.D degree in the eld of VLSI for Digital Signal Processing from Berhampur University, Orissa in 2000. In 1992 he was selected by OPSC (Orissa Public Service Commission) and joined as faculty member in the Department of Physics, SKCG College Paralakhemundi, Orissa. In 2001 he joined as Lecturer in EEE Department, BITS Pilani, Rajasthan. Then he joined as an Assistant Professor in the Department of ECE, Mody Institute of Education Research (Deemed University), Rajasthan. In 2003 he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he become Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of re-congurable VLSI architectures for resource-constrained digital signal processing applications. He has published nearly 30 technical papers. Currently he serves as the reviewers of IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Circuits and Systems for Video Technology, and IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Dr.Mohanty is a life time member of The Institution of Electronics and Telecommunication Engineering, New Delhi, India.

Pramod Kumar Meher (SM03) Pramod Kumar Meher (SM03) received B.Sc. and M.Sc. degrees in Physics and Ph.D. degree in science from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996, respectively. He has a wide scientic and technical background covering Physics, Electronics, and Computer Engineering. Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to this assignment he was a visiting faculty with the School of Computer Engineering, Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications with Utkal University, Bhubaneswar, India from 1997 to 2002, a Reader in Electronics with Berhampur University, Berhampur, India from 1993 to 1997, and a Lecturer in Physics with various Government Colleges in India from 1981 to 1993. His research interest includes design of dedicated and recongurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 170 technical papers to various reputed journals and conference proceedings. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India and a Fellow of the Institution of Engineering and Technology, UK. He is serving as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society, and Associate Editor for the IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, and Journal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for the year 1999.

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

You might also like