Hsia 2013

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO.
4, APRIL 2013 671
Memory-Efficient Hardware Architecture of

2-D Dual-Mode Lifting-Based Discrete
Wavelet Transform
Chih-Hsien Hsia, Member, IEEE, Jen-Shiun Chiang, Member, IEEE, and Jing-Ming Guo, Senior Member, IEEE
Abstract—Memory requirements (for storing intermediate sig- lifting-based scheme can provide low-complexity solutions for
nals) and critical path are essential issues for 2-D (or multi- image or video compression applications, such as JPEG2000
dimensional) transforms. This paper presents new algorithms
[13], motion-JPEG2000 [19], MPEG-4 still image coding [20],
and hardware architectures to address the above issues in 2-D
dual-mode (supporting 5/3 lossless and 9/7 lossy coding) lifting- and MC-EZBC [21]. Yet, the real-time 2-D DWT for multime-
based discrete wavelet transform (LDWT). The proposed 2-D dia application is still difficult to achieve, and thus, efficient
dual-mode LDWT architecture has the merits of low transpose transformation schemes for real-time application are highly
memory (TM), low latency, and regular signal ﬂow, making it demanded. Among the variety of DWT algorithms, lifting-
suitable for very large-scale integration implementation. The TM
based discrete wavelet transform (LDWT) is a new approach
requirement of the N×N 2-D 5/3 mode LDWT and 2-D 9/7 mode
LDWT are 2N and 4N, respectively. Comparison results indicate for constructing biorthogonal wavelet transforms, which can
that the proposed hardware architecture has a lower lifting-based also efficiently calculate classical wavelet transforms [2]–[11],
low TM size requirement than the previous architectures. As [13], [17], [18], [22]–[32]. Factoring the classical wavelet filter
a result, it can be applied to real-time visual operations such into lifting steps can reduce the computational complexity of
as JPEG2000, motion-JPEG2000, MPEG-4 still texture object
the corresponding DWT by up to 50% [12]. The lifting steps
decoding, and wavelet-based scalable video coding applications.
are easy to implement, which is different from the direct finite
Index Terms—2-D dual-mode lifting-based discrete wavelet impulse response (FIR) implementations of Mallat’s algorithm
transform, JPEG2000, low transpose memory.
[12]. Diou et al. [2] presented an architecture that performs the
LDWT with a 5/3 filter by interleaving technique. Andra et al.
I. Introduction [3] proposed a block-based simple four-processor architecture
that computes several stages of the DWT simultaneously.
T HE DISCRETE wavelet transform (DWT) has been
adopted for a wide range of applications, including
speech analysis, numerical analysis, signal analysis, image
Chen and Wu [4] proposed a folded and pipelined architecture
for the 2-D LDWT implementation, with memory of size 2.5N
coding, pattern recognition, computer vision, and biometrics. for an N×N 2-D DWT. This lifting architecture for vertical
It can be considered as a multiresolution decomposition of a filtering is divided into two parts, each consisting of one
signal into several components with different frequency bands. adder and one multiplier. Since both parts are activated in
Performing 2-D DWT requires many computations and a large different cycles, they can share the same adder and multiplier
block of transpose memory (TM) with long latency time. to increase the hardware utilization and reduce the latency.
Yet, to achieve real-time processing requires a low memory However, this architecture also has high complexity due to the
requirement and computational complexity to improve the characteristics of the signal flow. Chen [5] proposed a flexible
hardware utilization efficiency. Normally, implementations of and folded architecture for three-level 1-D LDWT to increase
2-D DWT are classified as convolution-based operation [1], the hardware utilization. Chiang and Hsia [6] proposed a 2-D
[14] and lifting-based operation [6]. Since the convolution- DWT folded architecture to improve the hardware utilization.
based implementations of DWT have high computational com- Jiang and Ortega [8] presented a parallel processing architec-
plexity and large memory requirements, lifting-based DWT ture that models the DWT computation as a finite state ma-
has been presented to overcome these drawbacks [12]. The chine and efficiently computes the wavelet coefficients near the
boundary of each segment of the input signal. Jung and Park
Manuscript received January 31, 2012; revised April 24, 2012 and June 11, [9] presented an efficient very large-scale integration (VLSI)
2012; accepted June 28, 2012. Date of publication August 6, 2012; date of
current version April 1, 2013. This paper was recommended by Associate architecture of the dual-mode LDWT used for JPEG2000.
Editor G. Lafruit. Chen [10] used a 1-D folded architecture to improve the
C.-H. Hsia and J.-M. Guo are with the Department of Electrical Engineering, hardware utilization of the 5/3 and 9/7 filters. The recursive
National Taiwan University of Science and Technology, Taipei 10607, Taiwan
(e-mail: chhsia@ee.tku.edu.tw; jmguo@seed.net.tw). architecture is a general scheme to implement any wavelet
J.-S. Chiang is with the Department of Electrical Engineering, Tamkang filter that is decomposed into lifting steps with less hardware
University, New Taipei 25137, Taiwan (e-mail: chiang@ee.tku.edu.tw). complexity. In [14], in average N 2 computing time (CT) can be
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. achieved for all DWT levels. However, it uses many multipliers
Digital Object Identifier 10.1109/TCSVT.2012.2211953 and adders. The architecture of [17] implements 2-D DWT
1051-8215/$31.00
c 2012 IEEE
672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013
with only TM by using the recursive pyramid algorithm. the memory requirement ranges from 2N to N 2 for the 2-D
Varshney et al. [22] presented energy-efficient single-processor 5/3 and 9/7 modes LDWT [2]–[6], [9], [13], [14], [17],
and fully pipelined architectures for the 2-D 5/3 lifting-based [18], [25]–[32]. To reduce the TM, the memory access must
JPEG2000. The single processor performs both row-wise and be redirected. This paper presents a new approach, namely,
column-wise processing simultaneously to achieve the 2-D interlaced read scan algorithm (IRSA) that changes the signal
transform with 100% hardware utilization. Tan and Arslan [24] reading order from row-wise only to mixed row- and column-
presented a shift-accumulator arithmetic logic unit architecture wise, and thus reduces the TM. Given an image of size N×N,
for the 2-D lifting-based JPEG2000 5/3 DWT. This architec- the internal memory sizes of the proposed 2-D LDWT for 5/3
ture has an efficient memory organization, which uses a small and 9/7 filters are simply 2N and 4N, respectively. The pro-
amount of embedded memory for processing and buffering. posed 2-D LDWT is based on parallel and pipelined schemes
Those architectures achieve multilevel decomposition using to increase the operation speed. For hardware implementa-
an interleaving scheme that reduces the size of memory tion, it can replace the multipliers with shifters and adders
and the number of memory accesses, yet slow throughput to yield high hardware utilization. Consequently, this 2-D
rates and inefficient hardware utilization are the key issues. LDWT has the characteristics of high hardware utilization, low
Huang et al. [25] proposed a generic random access memory memory requirement, and regular signal flow. In this paper, the
(RAM)-based architecture with high efficiency and feasibil- 256×256 2-D dual-mode LDWT was designed and simulated
ity for the 2-D DWT. Wu and Lin [26] presented a high- by VerilogHDL, and further synthesized by the Synopsys
performance and low-memory architecture to implement the design compiler with TSMC 0.18-μm 1P6M CMOS process
2-D dual-mode LDWT. The pipelined signal path of their technology.
architecture is regular and practical. Lan et al. [27] proposed a
scheme that can process two lines simultaneously by process-
ing two pixels in a clock period. Wu and Chen [28] proposed II. Lifting-Based DWT
an efficient VLSI architecture for the direct 2-D LDWT, in DWT performs multiresolution decomposition of the input
which the poly-phase decomposition and coefficient folding signals [1]. The original signals are first decomposed into
are adopted to improve the hardware utilization. Despite two subspaces, called the low-frequency (low-pass) and high-
the above improvements to the existed architectures, further frequency (high-pass) subbands. The classical DWT imple-
refinements in the algorithm and architecture are still needed. ments the decomposition (analysis) of a signal by a low-
The parameters of the 9/7 filter are irrational; thus, it is very pass digital filter H and a high-pass digital filter G. Both
difficult to implement it in a simple manner. The approach of the digital filters are derived using the scaling function
proposed in [29] implemented the 9/7 filter with integer and the corresponding wavelets. The system downsamples the
techniques using shifters and adders instead of multipliers for signal to decimate half of the filtered results in decomposition
motion-JPEG2000. The integer wavelet transform can achieve processing. The z-transfer functions of H(z) and G(z), based
nearly the same performance but require much less hardware on the four-tap and nonrecursive FIR filters with length L,
cost. Mohanty and Meher [30] proposed a scalable array are
architecture based on FPGA for the 2-D 9/7 LDWT (only
9/7 filter). This is a massively parallel processing architecture, H(z) = h0 + h1 z−1 + h2 z−2 + h3 z−3 (1)
which can scalably process the block size according to the
size of the input image and generally involves TM of size 10N. G(z) = g0 + g1 z−1 + g2 z−2 + g3 z−3 . (2)
The major advantage of this architecture is the saving of frame
buffer. However, it has some inherent problems associated with The reconstruction (synthesis) process is implemented using
the block size for an input image. Since the image quality an upsampling process. Mallat’s tree algorithm or pyramid
keeps increasing in the future, the block size increases accord- algorithm [1] can be used to find the multiresolution decom-
ingly; thus, the full parallel architecture may induce very high position DWT. The decomposition DWT coefficients at each
hardware requirement, such as adder, multiplier, register, and resolution level can be calculated as
multiplexer (MUX)/demultiplexer (DeMUX). Zhang et al. [31]
j

k−1
presented a pipelined architecture for the 2-D LDWT using XH (n) = G(z)Xj−1 H(2n − i) (3)
a nonseparable approach to avoid the transposition memory i=0
and frame buffer. However, nonseparable 2-D DWT leads to j

k−1
a higher complexity control component, since it needs more XL (n) = H(z)Xj−1 G(2n − i) (4)
i=0
computation time than the separable 2-D LDWT under the
same throughput rate. In addition, it is difficult to design the where j denotes the current resolution level, k denotes the
j
desired 2-D LDWT. Thus, several VLSI architectures have number of the filter tap, XH (n) denotes the nth high-pass DWT
j
been proposed for efficient implementation of the 2-D LDWT coefficient at the jth level, and XL (n) denotes the nth low-pass
[2]–[15], [17], [18], [22]–[32]. Yet, these hardware architec- DWT coefficient at the jth level.
tures need large TM components or arithmetic components. The image is first analyzed horizontally to generate two
A low TM requirement is the priority concern in spatial- subimages. The information is then sent into the second round
frequency domain implementation. In general, raster scan 1-D DWT to perform the vertical analysis to generate four
signal flow operations are popular in N×N 2-D DWT, and subbands, and each with a quarter of size of the original
HSIA et al.: MEMORY-EFFICIENT HARDWARE ARCHITECTURE 673
image. Considering an image of size N×N, each band is

subsampled by a factor of two, so that each wavelet frequency
band contains N/2×N/2 samples. The four subbands can
be integrated to generate an output image with the same
number of samples as the original one. Most of the image
compression applications can iteratively apply the above 2-D
wavelet decomposition to the LL subimage, each time forming
four new subband images, to minimize the energy in the lower Fig. 1. Block diagram of the LDWT.
frequency bands.
The lifting-based scheme proposed by Daubechies and
Sweldens requires fewer computations than that of the tra-
ditional convolution-based approach [12]. The lifting-based
scheme is an efficient implementation for DWT, in which
integer operations are used to avoid the problems caused by the Fig. 2. Block diagram flow of a traditional 2-D DWT.
finite precision or rounding. The Euclidean algorithm can be
used to factorize the poly-phase matrix of a DWT filter into a 3) Update phase: Xe[n] and d[n] are combined to obtain
sequence of alternating upper and lower triangular matrices the scaling coefficients s[n] after an update operator β as
and a diagonal matrix. The variables h(z) and g(z) in (5)
denote the low-pass and high-pass analysis filters, respectively, s[n] = Xe[n] + β × (d[n]). (9)
which can be divided into even and odd parts to generate the 4) Scaling: In the final step, the normalization factor is ap-
poly-phase matrix P(z) as shown in (6) plied on s[n] and d[n] to obtain the wavelet coefficients.
g(z) = ge (z2 ) + z−1 g0 (z2 ) For example, (10) and (11) describe the implementation
(5) of the lifting analysis, and are used to calculate the odd
h(z) = he (z2 ) + z−1 h0 (z2 )
(high-pass) and even coefficients (low-pass), respectively
he (z) ge (z)
P(z) = . (6) H[n] = d[n] × K0 (10)
h0 (z) g0 (z)
The Euclidean algorithm recursively finds the greatest com- L[n] = s[n] × K1 . (11)
mon divisors of the even and odd parts of the original filters.
Although the lifting-based scheme has low complexity, its
Since h(z) and g(z) form a complementary filter pair, P(z) can
long and irregular signal paths induce a major limitation for
be factorized as
efficient hardware implementations. In addition, the increasing
number of pipelined registers increases the internal memory
m
1 si (z) 1 0 k 0
P(z) = (7) size of the 2-D DWT architecture [13]. The 2-D LDWT uses
i=1
0 1 ti (z) 1 0 1/k vertical 1-D LDWT subband decomposition and a horizontal
1-D LDWT subband decomposition to yield the 2-D LDWT
where si (z) and ti (z) (for 1 i m, where m is the even degree
coefficients. Thus, the memory requirement dominates the
of the polynomial h(z)) are Laurent polynomials of lower
hardware cost and architectural complexity of 2-D LDWT.
orders. Computation of the upper triangular matrix is referred
Fig. 2 shows the block diagram flow of the traditional 2-D
to [12] as lifting the low-pass coefficient with the help of the
DWT operation. The default wavelet filters in JPEG2000 are
high-pass coefficient. Similarly, the computation of the lower
dual-mode (5/3 and 9/7 modes) LDWT [13]. Figs. 3 and 4
triangular matrix is lifting the high-pass coefficient with the
show the lifting-based steps associated with the dual-mode
aid of the low-pass coefficient [12]. This corresponds to the
wavelets. Assuming that the original signals are infinite in
prediction and update steps, respectively, and k is a nonzero
length, the first lifting stage is first applied to perform the
constant. Consequently, the filter bank can be factorized into
DWT.
three lifting steps.
Fig. 3 shows the lifting-based step associated with the
As illustrated in Fig. 1, the lifting-based scheme involves
wavelet. The original signals including X0, X1, X2, X3, X4,
the following four stages.
X5, . . . are the original input pixel sequences. If the original
1) Split phase: The original signal is divided into two signals are infinite in length, then the first-stage lifting is
disjoint subsets. The variables Xe and Xo denote the applied to predict the odd index data X1, X3, . . .. In (12),
sets of even and odd samples, respectively. This phase the parameters −1/2 and Hi denote the first-stage lifting
is called lazy wavelet transform because it does not parameter and outcome, respectively. Equations (12) and (13)
decorrelate the data, but only subsamples the signal into show the operation of the 5/3 integer LDWT
even and odd samples.
2) Predict phase: The predicting operator α is applied to H0 = [(X0 + X2] × α + X1] × K0 (12)
the subset Xo to obtain the wavelet coefficients d[n]
as L1 = [(H0 + H1) × β + X2] × K1 (13)
d[n] = Xo[n] + α × (Xe[n]). (8) where α = −1/2, β = 1/4, and K0 = K1 = 1.
TABLE I
Frame Rate of the Conventional 2-D LDWT
(Image Is of Size 640×480)
Execution Time (s/f) =

Level (s)
Frame rate (f/s)
Level 1 0.171 s/f = 5.85 f/s
Level 1 + Level 2 0.188 s/f = 5.32 f/s
Level 1 + Level 2 + Level 3 0.20 s/f = 4.93 f/s
Fig. 3. 5/3 LDWT algorithm. profiling tool [35]. It is clear that the 2-D DWT requires the
second highest computation time in the JPEG2000 system.
Table I shows the frame rates of the JPEG2000 encoded using
various levels of conventional 2-D LDWT. The experimental
environment is set using an AMD 1.8-GHz CPU, 1-GB RAM,
Microsoft Windows XP SP3, and Borland C++ Builder (BCB)
6.0. The frame rate is around 5 frames per second (f/s).
Apparently, the 2-D LDWT implementation must be revised
and enhanced to meet the real-time requirement.
III. Interlaced Read Scan Algorithm

In recent years, many 2-D LDWT architectures have been
proposed to meet the requirements of on-chip memory for
real-time processing. Yet, the hardware utilization of these
architectures needs to be further improved. In DWT imple-
mentation, a 1-D DWT needs massive computation; thus, the
Fig. 4. 9/7 LDWT algorithm. computation unit dominates most of the hardware cost [2]–[6],
[10]–[15], [22]–[31]. A 2-D DWT is composed of two 1-D
Together with the high-frequency lifting parameter, α, and DWTs and a block of TM. In the conventional approach, the
the input signal, we can find the first-stage high-frequency size of the TM is identical to the size of the processed image
wavelet coefficients H0. Subsequently, Hi together with the signal. Fig. 6(a) shows the concept of the proposed dual-mode
low-frequency parameter, β, and the input signals of the LDWT architecture that consists of the signal arrangement
second-stage low-frequency wavelet coefficients, L1, can be unit, processing element, memory unit, and control unit, as
found as well in (13). The third and fourth lifting stages can shown in Fig. 6(b). The outputs are fed to the 2-D LDWT
be found in a similar manner. Moreover, the finite-length signal four-subband coefficients, HH, HL, LH, and LL. The proposed
processed by the DWT leads to the boundary effect [17]. An architecture is described in detail in this section, and we focus
appropriate signal extension is required to maintain the same on the 2-D dual-mode LDWT.
number of the wavelet coefficients as the original signal. The Considering the computation unit, the TM becomes the main
embedded signal extension algorithm from [7] can be used to overhead of the 2-D DWT. Fig. 2 shows the block diagram
compute the boundary of the image. of the conventional 2-D DWT. Without losing generality, the
Similar to the one-level 1-D 5/3 mode LDWT, the calcula- 2-D 5/3 mode LDWT is considered for explaining the 2-D
tion of the one-level 1-D 9/7 mode LDWT is given as LDWT. Given an image of size N×N, a large block of TM
(order of N 2 ) is required to store the DWT coefficients after
d0 = (X0 + X2) × α + X1
the computation of the first-stage 1-D DWT decomposition.
s1 = (d0 + d1) × β + X2
(14) Subsequently, the second-stage 1-D DWT uses the stored data
H0 = [(s0 + s1) × γ + d0] × K0
to compute the 2-D DWT coefficients of the four subbands.
L1 = [(H0 + H1) × δ + s1] × K1
The computation and the access of the memory take time, and
where α = −1.586134142, β = −0.052980118, γ = therefore affect the latency that is long. Since it is a huge
0.882911075, δ = +0.443506852, K0 = 1.230174104, and quantity using the memory of size N 2 , the proposed IRSA is
K1 = 1/K0. The calculation comprises four lifting steps and employed to reduce the required TM to 2N or 4N (5/3 or 9/7
two scaling steps. mode).
JPEG2000 has a better quality at a low bit rate operation Without losing generality, let us consider an image of size
and higher compression ratio compared to the widely used still 6×6 to describe the 2-D 5/3 mode LDWT operation and IRSA.
image compression standard JPEG [13]. Moreover, JPEG2000 Fig. 7 shows the operation diagram of the 2-D 5/3 mode
has peculiar functions such as progressive image transmission LDWT operations of the above image. Herein, X(i, j), i =
determined by quality or resolution. Fig. 5 shows the CT 0 to 5, and j = 0 to 5 denotes the original image. The left
of various function blocks in JPEG2000 using the software two rows in gray are the left boundary extension rows, and
Fig. 5. Profile of JPEG2000 encoding function blocks.
Fig. 6. System block diagram of the proposed 2-D DWT. (a) 2-D dual-mode LDWT. (b) Block diagram of the proposed system architecture.
the row in gray on the right is the right boundary extension L(1, 0) = [H(0, 0) + H(1, 0)]/4 + X(2, 0). The calculated high-
row. The embedded signal extension algorithm [7] can be used frequency coefficients, H(i, j), and the low-frequency coeffi-
to compute the boundary of the image. The signal extension cients, L(i, j), are then used in the second-stage 1-D DWT to
is simply required at the beginning and at the end of the calculate the four subband coefficients, HH, HL, LH, and LL.
process. For JPEG2000, the standard data extension should be In the second-stage 1-D DWT of Fig. 7, the first HH
symmetrical. The left half of Fig. 7 shows the first-stage 1-D coefficient, HH(0, 0), is obtained using H(0, 2), H(0, 1), and
DWT operations. The right half of Fig. 7 shows the second- H(0, 0), in which HH(0, 0) = −[H(0, 0)+H(0, 2)]/2+H(0, 1).
stage 1-D DWT operations for yielding the four subband The other HH coefficients can be computed likewise using
coefficients, HH, HL, LH, and LL. In the first-stage 1-D three consecutive H(i, j) signals in a column. For the two
DWT, three pixels are used to calculate the 1-D high-frequency consecutive columns HH coefficients, the additional over-
coefficient. For example, X(0, 0), X(1, 0), and X(2, 0) are lapped H(i, j) signal is involved. For example, H(0, 2) is
used to yield the high-frequency coefficient H(0, 0), H(0, 0) the overlapped signal for computing HH(0, 0) and HH(0,
= −[X(0, 0) + X(2, 0)]/2 + X(1, 0). To calculate the next 1). To compute HL coefficients, it needs two consecutive
high-frequency coefficient H(1, 0), pixels X(2, 0), X(3, 0), and columns HH coefficients and an overlapped H(i, j) signal.
X(4, 0) are employed. The first pixel X(2, 0), which is called For example, HL(0, 1) is computed from HH(0, 0), HH(0, 1),
the overlapped pixel, is used to calculate both the H(0, 0) and H(0, 2), HL(0, 1) = [HH(0, 0) + HH(0, 1)]/4 + H(0, 2).
and the H(1, 0). The low-frequency coefficient is calculated The LH coefficients are computed from the L(i, j) signal and
using two consecutive high-frequency coefficients and the each LH coefficient is obtained from three L(i, j) signals.
overlapped pixel. For example, H(0, 0) and H(1, 0) along with For example, LH(0, 0) is computed from L(0, 0), L(0, 1),
X(2, 0) are used to calculate the low-frequency coefficient and L(0, 2), LH(0, 0) = −[L(0, 0) + L(0, 2)]/2 + L(0, 1). For
Fig. 7. Example of 2-D 5/3 mode LDWT operations. X(i, j): original image, i = 0 to 5 and j = 0 to 5. H(i, j): high-frequency wavelet coefficient of 1-D
LDWT. L(i, j): low-frequency wavelet coefficient of the 1-D LDWT. HH(i, j): high–high frequency wavelet coefficient of the 2-D LDWT. HL(i, j): high–low
frequency wavelet coefficient of the 2-D LDWT. LH(i, j): low–high frequency wavelet coefficient of the 2-D LDWT. LL(i, j): low–low frequency wavelet
coefficient of the 2-D LDWT.
the two consecutive columns LH coefficients, the additional IRSA is based on this idea, which can significantly reduce
overlapped L(i, j) signal is involved. For example, L(0, 2) is the requirement of the TM. The block diagram of the IRSA
the overlapped signal for computing LH(0, 0) and LH(0, 1). is shown in Fig. 8, in which the numbers on the top and left
To compute LL coefficients, two consecutive columns LH represent the coordinate indexes of a 2-D image. To increase
coefficients and an overlapped L(i, j) signal are required. For the operation speed, each time the IRSA scans two pixels in
example, LL(0, 1) is computed from LH(0, 0), LH(0, 1), and the consecutive rows. IN1 (initial in X(0, 0)) and IN2 (initial
L(0, 2), LL(0, 1) = [LH(0, 0) + LH(0, 1)]/4 + L(0, 2). in X(0, 1)) are the scanning inputs at the beginning. At the first
From the descriptions of the operations of the 2-D 5/3 mode clock, the system scans two pixels, X(0, 0) and X(0, 1), from
LDWT, we find that each 1-D high-frequency coefficient, IN1 and IN2, respectively. At the second clock, IN1 and IN2
H(i, j), is calculated from three image signals, and one of read pixels X(1, 0) and X(1, 1), respectively. At clock 3, IN1
the image signal is overlapped with the previous H(i, j). The and IN2 read pixels X(2, 0) and X(2, 1), respectively. As IN1
1-D low-frequency coefficient, L(i, j), is calculated from the and IN2 have read three pixels, the DWT processor computes
two consecutive rows in H(i, j) and an overlapped pixel. The the two 1-D high-frequency coefficients, H(0, 0) and H(0, 1),
HH, HL, LH, and LL coefficients are computed from H(i, j) and then stored in the TM for the subsequent computation of
and L(i, j). If we can change the scanning order of the first- the low-frequency coefficients. Pixels X(2, 0) and X(2, 1) are
stage 1-D LDWT and the output order of the second-stage 1-D stored in the internal memory for the subsequent computation
LDWT, during the 2-D LDWT operation we simply need to of the 1-D high-frequency coefficients.
store the H(i, j) to the TM [first-in first-out (FIFO) of size] At clock 4, the DWT processor scans pixels in rows 2 and 3,
and the overlapped pixels to the internal memory (R4 + R9 and IN1 and IN2 read pixels X(0, 2) and X(0, 3), respectively.
of size N). For an image of size N×N, the size of the TM At clock 5, IN1 and IN2 read pixels X(1, 2) and X(1, 3),
block can be reduced to 2N, as shown in Fig. 8. The proposed respectively. At clock 6, IN1 and IN2 read pixels X(2, 2)
and X(2, 3), respectively. At this moment, the DWT pro-

cessor computes the two high-frequency coefficients, H(0, 2)
and H(0, 3), upon pixels X(0, 2) to X(2, 2) and X(0, 3) to
X(2, 3), respectively, and the two high-frequency coefficients
are stored in the TM for the subsequent computation of
the low-frequency coefficients. Pixels X(2, 2) and X(2, 3) are
stored in the internal memory for the subsequent computation
of the high-frequency coefficients. Then (at clock 7) the DWT
processor scans the subsequent two rows to read three con-
secutive pixels in each row and compute the high-frequency
coefficients. The coefficients are stored in the TM, and pixels
X(2, 4) and X(2, 5) are stored in the internal memory. This
procedure continues reading three pixels and computing the
high-frequency coefficients and store the coefficients to the
TM, and store pixels X(2, j) and X(2, j + 1) to the internal
memory in each row until the last row.
Subsequently, the DWT processor scans rows 0 and 1 and Fig. 8. IRSA of the 2-D LDWT.
uses IN1 and IN2 read pixels X(3, 0) and X(3, 1), respectively.
At the next clock, IN1 and IN2 read pixels X(4, 0) and
X(4, 1), respectively. The DWT processor uses pixels X(3, 0),
X(4, 0), and X(2, 0) that were stored previously to compute the
high-frequency coefficient H(1, 0). Simultaneously, the DWT
processor used pixels X(3, 1), X(4, 1), and X(2, 1) that were
stored previously to compute the high-frequency coefficient
H(1, 1). As soon as H(1, 0) and H(1, 1) are found, H(0, 0),
H(1, 0), and X(2, 0) are used to generate the low-frequency
coefficient L(1, 0); H(0, 1), H(1, 1), and X(2, 1) are used to
generate the low-frequency coefficient L(1, 1). The computed
high-frequency coefficients are then stored in the TM, and
pixels X(4, 0) and X(4, 1) replace pixels X(2, 0) and X(2, 1)
for storing in the internal memory. IN1 and IN2 then read the
data in rows 2 and 3 to go through the same operations until
the last pixel. The detail operations are shown in Fig. 9.
The second-stage 1-D DWT works in a similar manner as
the first-stage 1-D DWT. In the HH and HL operations, when
three consecutive H(i, j) signals in a column are found in the Fig. 9. Detail operations of the first-stage 1-D DWT.
first-stage 1-D DWT, an HH coefficient can be computed. As
soon as three consecutive columns HH coefficients are found,
the two HH coefficients and the overlapped H(i, j) can be in the architecture, namely, the first-stage 1-D DWT and the
combined to compute an HL coefficient. Similarly, when three second-stage 1-D DWT. Herein, we focus on the 2-D 5/3 mode
consecutive L(i, j) signals in a column are found in the first- LDWT.
stage 1-D DWT, an LH coefficient can be computed. As soon
as two LH coefficients are found, the two LH coefficients and A. First-Stage 1-D LDWT
the overlapped L(i, j) are used to compute an LL coefficient. The first-stage 1-D LDWT architecture consists of the
Fig. 10 shows the detailed operations for the second-stage 1-D following units: signal arrangement, multiplication and ac-
DWT. cumulation cell (MAC), MUX, DeMUX, and FIFO register.
Fig. 11 shows the corresponding block diagram.
The signal arrangement unit consists of three registers, R1,
IV. VLSI Architecture and Implementation for R2, and R3. The pixels are input to R1 first, and subsequently
the 2-D Dual-Mode LDWT the content of R1 is transferred to R2 and then R3. In
The IRSA has been discussed in the previous section and the meantime, R1 keeps reading the following pixels. The
the architecture of the IRSA is described in this section. The operation is like a shift register. As soon as the R1, R2, and
control unit is manipulating to read off-chip memory. In IRSA, R3 receive signal data, MAC starts operating. Fig. 12 shows
two pixels are scanned concurrently and the system needs two the signal arrangement unit, in which MAC operates at the
processing units. For the 2-D LDWT processing, the pixels are clock with gray circles. Each first-stage 1-D DWT receives one
processed by the first-stage 1-D DWT. The outputs are then input data at one internal clock cycle to generate an output with
fed to the second-stage 1-D DWT to yield the four subband high-frequency and low-frequency coefficients alternately. The
coefficients, HH, HL, LH, and LL. Two parts are involved outputs of H and L are further fed into the second-stage
Fig. 10. Detailed operations of the second-stage 1-D DWT. (a) Operations for the H.F (HH and HL) part. (b) Operations for the L.F (LH and LL) part.
Fig. 13. Data flow. The FIFO latency is omitted here.
two functions: 1) storing the high-frequency coefficients for

the low-frequency coefficient calculation and 2) being used
as a signal buffer for MAC. MAC needs time to compute
the signal, and the output of MAC cannot directly feed to
Fig. 11. Architecture of the first-stage 1-D DWT. the output or the following operation may be incorrect due to
the synchronization problems. R5 acts as an output buffer for
MAC to prevent the error in the following operations. In the
5/3 integer lifting-based operations, MAC is used to find the
results of the high-frequency output, −(X1 + X3)/2 + X2, and
the low-frequency output, (X1+X3)/4+X2. To save hardware,
we can use shifters to implement the two multiplications, −1/2
and 1/4. Consequently, the MAC needs adders, complementer,
and shifters. Fig. 14 shows the MAC block diagram, in which
X1, X2, and X3 denote the inputs, “−” denotes the 2’s
complement converter, and “>>” denotes the right shifter.
B. Second-Stage 1-D LDWT

Fig. 12. Operation of the signal arrangement unit (e.g., IRAS signal in N1).
Similar to the first-stage 1-D DWT, the second-stage 1-D
DWT consists of the following units: signal arrangement,
MAC, and MUX, as shown in Fig. 15. Due to the parallel
1-D DWT. The second-stage 1-D DWT receives two inputs architecture, two outputs are generated concurrently from the
separately from H and L per clock cycle to generate two first-stage 1-D DWT, and the two outputs must be merged in
subband coefficients. Fig. 13 shows the detailed data flow of the second-stage 1-D DWT. The signal arrangement unit han-
Fig. 12 for one-level decomposition of a 4×4 block image. dles the signal merging and Fig. 16 shows the corresponding
For the low-frequency coefficients calculation, we need processing diagram. At the beginning, signals H0 and H1 are
two high-frequency coefficients and an original pixel. Internal from IN1 and IN2, and the two signals are stored in R3 and R4,
register R4 is used to store the original even pixel (N1) and respectively. At the next clock, H0 and H1 are moved to R1
internal register R9 is used to store the original odd pixel (N2). and R2, respectively, and concurrently new signals H3 and H4
We can simply shift the content of R3 to R4 after the MAC from IN1 and IN2 are stored to R3 and R4, respectively. The
operation. FIFO is used to store the high-frequency coefficients signal arrangement unit operates repeatedly to input signals
to calculate the low-frequency coefficients. Register R5 has for the second-stage 1-D DWT.
Fig. 17. Input signal sequences. (a) IN1 read signal of even row in zigzag
order. (b) IN2 read signal of odd row in zigzag order.
Fig. 14. Block diagram of MAC.
Fig. 18. System diagram of the 2-D LDWT.
1-D DWT, control unit, and the MAC unit. Fig. 18 shows the
block diagram of the proposed 2-D LDWT that consists of the
two stages: the first-stage 1-D DWT and the second-stage 1-D
DWT. This architecture simply needs a small amount of TM.
Fig. 15. Block diagram of the second-stage 1-D LDWT. According to (12) and (13), the proposed IRSA architecture
can also be applied to the 9/7 mode LDWT. Fig. 19 illustrates
the approach. According to Figs. 9 and 10, the original signals
(denoted as black circles) for the both 5/3 and 9/7 mode
LDWTs can be processed by the same IRSA for the first-stage
1-D DWT operation. The high-frequency signals (denoted as
gray circles), the correlated low-frequency signals, and the
results of the first stage are used to compute the second-stage
1-D DWT coefficients. Compared to the 9/7 mode LDWT
computation, the 5/3 mode LDWT is much easier for computa-
tion, and the registers arrangement as shown in Figs. 11 and 15
is simple. For 9/7 mode LDWT implementation with the
same system architecture of 5/3 mode LDWT, the following
modifications are required.
1) The control signals of the MUX in Figs. 11 and 15 must
be modified by rearranging the registers of the MAC
Fig. 16. Signal merging process for the signal arrangement unit. block to process the 9/7 parameters.
2) The wavelet coefficients of the dual-mode LDWT are
different. The coefficients for 5/3 mode LDWT are
C. 2-D LDWT Architecture α = −1/2 and β = 1/4, but the coefficients are α =
In the proposed IRSA operation, IN1 and IN2 read signals −1.586134142, β = −0.052980118, γ = +0.882911075,
of even row and odd row in a zigzag order, respectively, as and δ = +0.443506852 for 9/7 mode LDWT. For calcula-
shown in Fig. 17. Fig. 18 shows the complete architecture tion simplicity, the integer approach proposed by Hwang
of the 2-D LDWT, in which four parts are involved: two et al. [23] is employed. Similar to the multiplication
sets of the first-stage 1-D DWT, two sets of the second-stage implementation using shifters and adders in the 5/3
Fig. 19. Processing procedures of 2-D dual-mode LDWTs under the same IRSA architecture.
mode LDWT, the shifters approach proposed in [17] is

employed to implement the 9/7 mode LDWT.
3) According to the characteristics of the 9/7 mode LDWT,
the control unit in Fig. 18 must be modified accordingly.
The multilevel DWT computation can be implemented in
a similar manner by the high-performance one-level 2-D
LDWT. For the multilevel computation, the architecture
needs N 2 /4 off-chip memory. As illustrated in Fig. 20,
the off-chip memory is used to temporarily store the LL
Fig. 20. Multilevel 2-D DWT architecture.
subband coefficients for the next iteration computations.
The second-level computation requires N/2 counters
and N/2 FIFOs for the control unit. The third-level V. Experimental Results and Comparisons
computation requires N/4 counters and N/4 FIFOs for A tradeoff exists between low TM and low complexity in
the control unit. In general, N/j−1 counters and N/2j−1 the design of the VLSI 2-D dual-mode LDWT architecture.
FIFOs are required for the jth level computation. Tables II and III show the performance comparisons of the
TABLE II
Comparisons of One-Level 2-D Architectures for 5/3 LDWT
Proposed Ours [2] [3] [4] [5] [6] [13] [18] [25] [26]
TM (bytes) 2N 3.5N 3.5N 2.5N 3N N 2 /4 + 5N N2 2N 3.5N 3.5N
Latency (3/2)N + 3 – 2N + 5 3 – 3 – – – –
CT (3/4)N 2 + (3/2)N + 7 – (N 2 /2) + N + 5 N2 (N 2 /2) + N + 5 N2 – (N 2 /2)+N – 10 + (4/3)N 2 [1−(1/4)]
+ 2N[1 − (1/2)]
Adders 8 12 8 6 5 4 12 8 – –
Multipliers 0 6 4 4 0 0 6 0 – 6
TM size is used to store frequency coefficients in the one-level 2-D DWT.
In a system, latency is often used to mean any delay or waiting time that increases real or perceived response time beyond the response time desired. For
example, specific contributors to 2-D DWT latency include from original image input to first subband output in signal.
In a system, CT represents the time used to compute an image of size N × N.
Suppose the image is of size N × N.
TABLE III
Comparisons of One-Level 2-D Architectures for 9/7 LDWT
Proposed Ours [3] [9] [10] [14] [17] [25] [26] [27] [28]
TM (bytes) 4N N2 12N N 2 /4 + LN + L 22N 14N 5.5N 5.5N – N 2 + 4N + 4
Latency (3/2)N + 3 – – – – – – – – –
CT (3/4)N 2 + (3/2)N + 7 4N2 /3 + 2 N2 N 2 /2∼(2/3)N N2 – – 22 + (4/3)N 2 [1 − (1/4)] – 2N2 /3
+6N[1 − (1/2)]
Adders 16 8 12 4L 36 16 16 8 32 16
Multipliers 0 4 9 4L 36 12 10 6 20 16
TM size is used to store frequency coefficients in the 1-L 2-D DWT.
In a system, latency is often used to mean any delay or waiting time that increases real or perceived response time beyond the response
time desired. For example, specific contributors to 2-D DWT latency include from original image input to first subband output in signal.
In a system, CT represents the time used to compute an image of size N × N.
L: the filter length.
TABLE IV
Hardware Cost and Performance Comparisons of Various 9/7 2-D LDWT High-Throughput Architectures
Method Multipliers Adders TM CP Throughput

(Bytes) (Input/Output)
Huang et al. [23] 10 16 5.5N 1Tm + 5Ta 2
Huang et al. [25] 10 16 5.5N 4Tm + 8Ta 2
Mohanty and Meher [30] 27 48 10N 1Tm + 2Ta 4
Ours 0 16 4N 2Tm + 4Ta 2
TM size is used to store frequency coefficients
in the one-level 2-D DWT.
In a system, CP represents the time used to compute
an image of size N × N.
proposed architecture and former architectures in the literature. the TM requirement and shorten the critical path (CP), the
Comparison results indicate that the proposed VLSI architec- proposed IRSA is proposed to change the signal flow from
ture outperforms the previous works in terms of TM size, in row-wise to mixed row- and column-wise. The IRSA is in
particular around 50% less memory requirement than that of parallel and pipelined orientation, which can improve the
the JPEG2000 standard [13]. Moreover, the 2-D LDWT is processing speed significantly. In addition, the TM of an image
frame based, and its implementation bottleneck is the huge of size N × N simply needs 2N or 4N for 5/3 or 9/7 schemes,
TM. In this paper, less memory units are needed with the respectively, with the new approach.
proposed architecture, and the latency is fixed at (3/2)N + 3 In the proposed high-throughput 2-D LDWT architecture,
clock cycles. Chen and Wu [4] proposed a folded and pipelined we have considered the tradeoffs between the TM and the
architecture to compute the 2-D 5/3 LDWT, and the TM is of CP in VLSI implementation. Table IV shows the hardware
size 2.5N for the N×N 2-D DWT. This lifting architecture cost and typical performance comparisons of various 9/7 high-
for vertical filtering with two adders and one multiplier is throughput 2-D DWTs in terms of the number of multipliers
divided into two parts, and each part has one adder and one and adders, TM size, CP, and throughput. The length of the
multiplier. Because both parts are activated in different cycles, input signal is 8 b. The lengths of the adder, multiplier, and
the same adder and multiplier can be shared to increase the register are 16 b with 11 integer bits and 5 fractional bits
hardware utilization and reduce the latency. As mentioned [33]. For the proposed 2-D LDWT architecture, the TM is
above, according to the characteristics of the signal flow 4N and the CP is 2Tm + 4Ta (Tm and Ta denote the delay
operation, it may increase the operation complexity. To reduce time of the multiplier and adder, respectively). The column
TABLE V had a mixed row- and column-wise signal flow, rather than
Design Specification of the Proposed 2-D LDWT purely row-wise as in traditional 2-D LDWT. Moreover, a
new approach, namely, IRSA, was proposed to reduce the
Chip specification N = 256, tile size = 256 × 256 TM requirement for the 2-D dual-mode LDWT. The proposed
Gate count 29 196 gates 2-D architectures were more efficient than former architectures
Power supply 1.8 V
in trading off low TM requirement, output latency, control
Technology TSMC 0.18-μm 1P6M (CMOS)
Wireload model TSMC 0.18− wl10 complexity, and regular memory access sequence. The pro-
Latency (3/2)N + 3 = 387 clock cycle posed architecture reduced the TM significantly to a memory
TM size 2-D 5/3 DWT: 1024 bytes size of only 2N or 4N (5/3 or 9/7 mode), and reduced the
2-D 9/7 DWT: 2048 bytes latency to (3/2)N + 3. Due to the regularity and simplicity
Power estimation 15.47 mW of the IRSA LDWT architecture, a dual mode (5/3 and 9/7)
CT (3/4)N 2 + (3/2)N + 7 = 49 543 clock cycle 256 × 256 2-D LDWT prototyping chip was designed by
Maximum clock rate 100 MHz
TSMC 0.18-μm 1P6M standard CMOS technology. The 5/3
and 9/7 filters with different lifting steps were realized by
processor needs eight registers for storing the original pixels cascading the four modules (split, predict, update, and scaling
and coefficients of X(i, j), H(i, j), and L(i, j). Each of the phases). The prototyping chip took 29 196 gate counts and
row and column processors needs zero multipliers and eight could operate at 100 MHz. The method was applicable to any
adders. DWT-based signal compression standard, such as JPEG2000,
The delays of a 16-b multiplier and adder are 6.79 and motion-JPEG2000, MPEG-4 still texture object decoding, and
3.01 ns, respectively. The gate counts of a 16-b multiplier wavelet-based SVC.
are 762 [34]. The gate counts of the 256 × 16 static RAM
(1N = 256 × 16) are 5252 [32]. According to Table IV, the References
flipping architecture [23] and the generic RAM-based archi-
[1] S. G. Mallat, “A theory for multi-resolution signal decomposition: The
tecture [25] require identical TM of 5.5N, and CPs of 1Tm wavelet representation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 11,
+ 5Ta (21.84 ns) and 4Tm + 8Ta (51.24 ns), respectively. The no. 7, pp. 674–693, Jul. 1989.
architecture proposed by Mohanty and Meher [30] requires [2] C. Diou, L. Torres, and M. Robert, “An embedded core for the 2-D
wavelet transform,” in Proc. IEEE Emerg. Technol. Factory Automat.,
TM of size 10N (52 520 gates), while the proposed design Oct. 2001, pp. 179–186.
simply needs 4N. However, its CP of [30] is 1Tm + 2Ta [3] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture for
(12.81 ns), which is much faster than the proposed architecture lifting-based forward and inverse wavelet transform,” IEEE Trans. Signal
Process., vol. 50, no. 4, pp. 966–977, Apr. 2002.
of 2Tm + 4Ta (25.62 ns). Yet, when designing the VLSI [4] S.-C. Chen and C.-C. Wu, “An architecture of 2-D 3-level lifting-based
architecture of the 2-D dual-mode LDWT, we must consider discrete wavelet transform,” in Proc. VLSI Des./CAD Symp., Aug. 2002,
the tradeoff between TM and computation complexity. Herein, pp. 351–354.
[5] P.-Y. Chen, “VLSI implementation of lifting discrete wavelet transform
27 multipliers are used in [30], while simply zero multipliers using the 5/3 filter,” IEICE Trans. Inform. Syst., vol. E85-D, no. 12, pp.
are required with the proposed scheme. Our approach can thus 1893–1897, Dec. 2002.
save hardware cost significantly. [6] J.-S. Chiang and C.-H. Hsia, “An efficient VLSI architecture for 2-D
DWT using lifting scheme,” in Proc. IEEE Int. Conf. Syst. Signals, Apr.
According to Table IV, the proposed 2-D LDWT architec- 2005, pp. 528–531.
ture outperforms the former systems (high-throughput archi- [7] K. C. B. Tan and T. Arslan, “Low power embedded extension algorithm
tectures) [23], [25], [30] in terms of TM size and CP. This 2-D for the lifting based discrete wavelet transform in JPEG2000,” IET
Electron. Lett., vol. 37, no. 22, pp. 1328–1330, Oct. 2002.
DWT architecture is lifting based such that the reduction of the [8] W. Jiang and A. Ortega, “Lifting factorization-based discrete wavelet
TM is significant. In addition, parallel and pipelining schemes transform based architecture design,” IEEE Trans. Circuits Syst. Video
are adopted to reduce the internal memory requirement. Technol., vol. 11, no. 5, pp. 651–657, May 2001.
[9] G.-C. Jung and S.-M. Park, “VLSI implement of lifting wavelet trans-
A 256 × 256 2-D LDWT was designed and simulated with form of JPEG2000 with efficient RPA (recursive pyramid algorithm)
VerilogHDL and further synthesized by the Synopsys design realization,” IEICE Trans. Fundamentals, vol. E88-A, no. 12, pp. 3508–
compiler with TSMC 0.18-μm 1P6M CMOS standard process 3515, Dec. 2005.
[10] P.-Y. Chen, “VLSI implementation for one-dimensional multilevel
technology to verify the performance of the proposed hardware lifting-based wavelet transform,” IEEE Trans. Comput., vol. 53, no. 4,
architecture. The detailed specifications of the 256 × 256 2-D pp. 386–398, Apr. 2004.
LDWT are listed in Table V. There have been increasing [11] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Analysis and VLSI archi-
tecture for 1-D and 2-D discrete wavelet transform,” IEEE Trans. Signal
concerns regarding the efficacy of wireload models as deep- Process., vol. 53, no. 4, pp. 1575–1586, Apr. 2005.
submicrometer (0.18-μm process technology in this paper) [12] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into
parasitic interconnections. The TSMC 0.18− wl10 script is lifting steps,” J. Fourier Anal. Applicat., vol. 4, no. 3, pp. 247–269,
1998.
used for the wireload model to simulate the performance. This [13] Information Technology, JPEG 2000 Part 1. Final Committee Draft
approach can also be extended to acquire power-efficient 2-D Version 1.0, document ISO/IEC 15444-1 JTC1/SC29 WG1, 2000.
LDWT architectures. [14] M. Vishwanath, R. M. Owens, and M. J. Irwin, “VLSI architecture for
the discrete wavelet transform,” IEEE Trans. Circuits Syst. II, vol. 42,
no. 5, pp. 305–316, May 1995.
[15] JPEG 2000 Veriﬁcation Model 9.0, document ISO/IEC JTC1/SC29/
VI. Conclusion WG1 Wgln 1684, 2000.
[16] S. G. Mallat, “Multi-frequency channel decompositions of images and
This paper presented a new architecture to reduce the TM wavelet models,” IEEE Trans. Acoust., Speech Signal Process., vol.
requirement of the 2-D LDWT. The proposed architecture ASSP-37, no. 12, pp. 2091–2110, Dec. 1989.
[17] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Efficient VLSI architec- Chih-Hsien Hsia (M’10) was born in Taipei, Tai-
ture of lifting-based discrete wavelet transform by systematic design wan, in 1979. He received the B.S. degree in com-
method,” in Proc. IEEE Int. Symp. Circuits Syst., May 2002, pp. 26– puter science and information engineering from the
29. Taipei Chengshih University of Science and Tech-
[18] K. Mei, N. Zheng, and H. van de Wetering, “High-speed and memory- nology, Taipei, in 2003, and the M.S. degree in
efficient VLSI design of 2-D DWT for JPEG2000 applications,” IET electrical engineering and the Ph.D. degree from
Electron. Lett., vol. 42, no. 16, pp. 907–908, Aug. 2006. Tamkang University, New Taipei, Taiwan, in 2005
[19] Information Technology, Motion JPEG2000, document ISO/IEC and 2010, respectively.
ISO/IEC 15444-3, 2002. From July 2007 to September 2007, he was a
[20] Information Technology, Coding of Moving Picture and Audio, docu- Visiting Scholar with Iowa State University, Ames.
ment ISO/IEC JTC1/SC29 WG11, 2001. He is currently a Post-Doctoral Research Fellow
[21] P. Chen and J. W. Woods, “Bidirectional MC-EZBC with lifting imple- with the Multimedia Signal Processing Laboratory, Graduate Institute of
mentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 10, Electrical Engineering, National Taiwan University of Science and Technol-
pp. 1183–1194, Oct. 2004. ogy, Taipei. He is an Adjunct Associate Professor with the Department of
[22] H. Varshney, M. Hasan, and S. Jain, “Energy efficient novel architectures Electrical Engineering, Tamkang University. His current research interests
for the lifting-based discrete wavelet transform,” IET Image Process., include digital signal processing integrated circuit design, image and video
vol. 1, no. 3, pp. 305–310, Sep. 2007. processing, multimedia compression system design, multiresolution signal
[23] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Flipping structure: An processing algorithms, and computer and robot vision processing.
efficient VLSI architecture for lifting-based discrete wavelet transform,” Dr. Hsia is a member of the Phi Tau Phi Scholastic Honor Society.
IEEE Trans. Signal Process., vol. 52, no. 4, pp. 1080–1089, Apr.
2004.
[24] K. C. B. Tan and T. Arslan, “Shift-accumulator ALU centric JPEG 2000
5/3 lifting based discrete wavelet transform architecture,” in Proc. IEEE Jen-Shiun Chiang (M’90) received the B.S. degree
Int. Symp. Circuits Syst., May 2003, pp. V161–V164. in electronics engineering from Tamkang University,
[25] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Generic RAM-based archi- New Taipei, Taiwan, in 1983, the M.S. degree in
tectures for two-dimensional discrete wavelet transform with line-based electrical engineering from the University of Idaho,
method,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 7, pp. Moscow, in 1988, and the Ph.D. degree in electrical
910–919, Jul. 2005. engineering from Texas A&M University, College
[26] B.-F. Wu and C.-F. Lin, “A high-performance and memory-efficient Station, in 1992.
pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of In 1992, as an Associate Professor, he joined the
JPEG2000 codec,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, faculty of the Department of Electrical Engineering,
no. 12, pp. 1615–1628, Dec. 2005. Tamkang University, where he is currently a Pro-
[27] X. Lan, N. Zheng, and Y. Liu, “Low-power and high-speed VLSI fessor. His current research interests include digital
architecture for lifting-based forward and inverse wavelet transform,” signal processing for very large-scale integration architectures, architecture
IEEE Trans. Consumer Electron., vol. 51, no. 2, pp. 379–385, May for image data compressing, system-on-chip design, analog-to-digital data
2005. conversion, and low-power circuit design.
[28] P.-C. Wu and L.-G Chen, “An efficient architecture for two-dimensional
discrete wavelet transform,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 11, no. 4, pp. 536–545, Apr. 2001.
[29] W.-M. Li, C.-H. Hsia, and J.-S. Chiang, “Memory-efficient architecture Jing-Ming Guo (M’06–SM’10) was born in Kaoh-
of 2-D dual-mode lifting scheme discrete wavelet transform for motion- siung, Taiwan, on November 19, 1972. He received
JPEG2000,” in Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp. the B.S.E.E. and M.S.E.E. degrees from National
750–753. Central University, Taoyuan, Taiwan, in 1995 and
[30] B.-K. Mohanty and P. K. Meher, “Memory efficient modular VLSI archi- 1997, respectively, and the Ph.D. degree from the
tecture for highthroughput and low-latency implementation of multilevel Institute of Communication Engineering, National
lifting 2-D DWT,” IEEE Trans. Signal Process., vol. 59, no. 5, pp. 2072– Taiwan University, Taipei, Taiwan, in 2004.
2084, May 2011. From 1998 to 1999, he was an Information Tech-
[31] C. Zhang, C. Wang, and M. O. Ahmed, “A pipeline VLSI architecture nique Officer in the Chinese Army. He is currently
for fast computation of the 2-D discrete wavelet transform,” IEEE Trans. a Professor with the Department of Electrical Engi-
Circuit Syst. I, vol. 59, no. 8, pp. 1775–1785, Aug. 2012. neering, National Taiwan University of Science and
[32] M. Maamoun, M. Neggazi, A. Meraghni, and D. Berkani, “VLSI design Technology, Taipei. His current research interests include multimedia signal
of 2-D discrete wavelet transform for area-efficient and high-speed processing, multimedia security, computer vision, and digital halftoning.
image computing,” World Acad. Sci., Eng. Technol., vol. 35, pp. 538– Dr. Guo was a recipient of the Outstanding Youth Electrical Engineer Award
543, 2008. from the Chinese Institute of Electrical Engineering in 2011, the Outstanding
[33] B.-D. Choi, K.-S. Choi, M.-C. Hwang, J.-K. Cho, and S.-J. Ko, “Real- Young Investigator Award from the Institute of System Engineering in 2011,
time DSP implementation of motion-JPEG2000 using overlapped block the Best Paper Award from the IEEE International Conference on System
transferring and parallel-pass methods,” Real-Time Imag., vol. 10, no. Science and Engineering in 2011, the Excellence in Teaching Award in 2009,
5, pp. 277–284, Oct. 2004. the Research Excellence Award in 2008, the Acer Dragon Thesis Award in
[34] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design 2005, the Outstanding Paper Award from IPPR Computer Vision, Graphics,
Circuits and Systems. Norwell, MA: Kluwer, 1995. and Image Processing in 2005 and 2006, and the Outstanding Faculty Award
[35] H. Muta, M. Doi, H. Nakano, and Y. Mori, “Multilevel parallelization on in 2002 and 2003. From 2003 to 2004, he was granted the National Science
the cell/B.E. for a motion JPEG 2000 encoding server,” in Proc. ACM Council Scholarship for advanced research from the Department of Electrical
Workshops Multimedia, Sep. 2007, pp. 942–951. and Computer Engineering, University of California, Santa Barbara.

Hsia 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hsia 2013

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO.

4, APRIL 2013 671

Memory-Efficient Hardware Architecture of

image. Considering an image of size N×N, each band is

Execution Time (s/f) =

III. Interlaced Read Scan Algorithm

Fig. 5. Profile of JPEG2000 encoding function blocks.

and X(2, 3), respectively. At this moment, the DWT pro-

Fig. 13. Data flow. The FIFO latency is omitted here.

two functions: 1) storing the high-frequency coefficients for

B. Second-Stage 1-D LDWT

Fig. 14. Block diagram of MAC.

Fig. 18. System diagram of the 2-D LDWT.

mode LDWT, the shifters approach proposed in [17] is

Method Multipliers Adders TM CP Throughput

You might also like