Professional Documents
Culture Documents
Abstract—Memory requirements (for storing intermediate sig- lifting-based scheme can provide low-complexity solutions for
nals) and critical path are essential issues for 2-D (or multi- image or video compression applications, such as JPEG2000
dimensional) transforms. This paper presents new algorithms
[13], motion-JPEG2000 [19], MPEG-4 still image coding [20],
and hardware architectures to address the above issues in 2-D
dual-mode (supporting 5/3 lossless and 9/7 lossy coding) lifting- and MC-EZBC [21]. Yet, the real-time 2-D DWT for multime-
based discrete wavelet transform (LDWT). The proposed 2-D dia application is still difficult to achieve, and thus, efficient
dual-mode LDWT architecture has the merits of low transpose transformation schemes for real-time application are highly
memory (TM), low latency, and regular signal flow, making it demanded. Among the variety of DWT algorithms, lifting-
suitable for very large-scale integration implementation. The TM
based discrete wavelet transform (LDWT) is a new approach
requirement of the N×N 2-D 5/3 mode LDWT and 2-D 9/7 mode
LDWT are 2N and 4N, respectively. Comparison results indicate for constructing biorthogonal wavelet transforms, which can
that the proposed hardware architecture has a lower lifting-based also efficiently calculate classical wavelet transforms [2]–[11],
low TM size requirement than the previous architectures. As [13], [17], [18], [22]–[32]. Factoring the classical wavelet filter
a result, it can be applied to real-time visual operations such into lifting steps can reduce the computational complexity of
as JPEG2000, motion-JPEG2000, MPEG-4 still texture object
the corresponding DWT by up to 50% [12]. The lifting steps
decoding, and wavelet-based scalable video coding applications.
are easy to implement, which is different from the direct finite
Index Terms—2-D dual-mode lifting-based discrete wavelet impulse response (FIR) implementations of Mallat’s algorithm
transform, JPEG2000, low transpose memory.
[12]. Diou et al. [2] presented an architecture that performs the
LDWT with a 5/3 filter by interleaving technique. Andra et al.
I. Introduction [3] proposed a block-based simple four-processor architecture
that computes several stages of the DWT simultaneously.
T HE DISCRETE wavelet transform (DWT) has been
adopted for a wide range of applications, including
speech analysis, numerical analysis, signal analysis, image
Chen and Wu [4] proposed a folded and pipelined architecture
for the 2-D LDWT implementation, with memory of size 2.5N
coding, pattern recognition, computer vision, and biometrics. for an N×N 2-D DWT. This lifting architecture for vertical
It can be considered as a multiresolution decomposition of a filtering is divided into two parts, each consisting of one
signal into several components with different frequency bands. adder and one multiplier. Since both parts are activated in
Performing 2-D DWT requires many computations and a large different cycles, they can share the same adder and multiplier
block of transpose memory (TM) with long latency time. to increase the hardware utilization and reduce the latency.
Yet, to achieve real-time processing requires a low memory However, this architecture also has high complexity due to the
requirement and computational complexity to improve the characteristics of the signal flow. Chen [5] proposed a flexible
hardware utilization efficiency. Normally, implementations of and folded architecture for three-level 1-D LDWT to increase
2-D DWT are classified as convolution-based operation [1], the hardware utilization. Chiang and Hsia [6] proposed a 2-D
[14] and lifting-based operation [6]. Since the convolution- DWT folded architecture to improve the hardware utilization.
based implementations of DWT have high computational com- Jiang and Ortega [8] presented a parallel processing architec-
plexity and large memory requirements, lifting-based DWT ture that models the DWT computation as a finite state ma-
has been presented to overcome these drawbacks [12]. The chine and efficiently computes the wavelet coefficients near the
boundary of each segment of the input signal. Jung and Park
Manuscript received January 31, 2012; revised April 24, 2012 and June 11, [9] presented an efficient very large-scale integration (VLSI)
2012; accepted June 28, 2012. Date of publication August 6, 2012; date of
current version April 1, 2013. This paper was recommended by Associate architecture of the dual-mode LDWT used for JPEG2000.
Editor G. Lafruit. Chen [10] used a 1-D folded architecture to improve the
C.-H. Hsia and J.-M. Guo are with the Department of Electrical Engineering, hardware utilization of the 5/3 and 9/7 filters. The recursive
National Taiwan University of Science and Technology, Taipei 10607, Taiwan
(e-mail: chhsia@ee.tku.edu.tw; jmguo@seed.net.tw). architecture is a general scheme to implement any wavelet
J.-S. Chiang is with the Department of Electrical Engineering, Tamkang filter that is decomposed into lifting steps with less hardware
University, New Taipei 25137, Taiwan (e-mail: chiang@ee.tku.edu.tw). complexity. In [14], in average N 2 computing time (CT) can be
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. achieved for all DWT levels. However, it uses many multipliers
Digital Object Identifier 10.1109/TCSVT.2012.2211953 and adders. The architecture of [17] implements 2-D DWT
1051-8215/$31.00
c 2012 IEEE
672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013
with only TM by using the recursive pyramid algorithm. the memory requirement ranges from 2N to N 2 for the 2-D
Varshney et al. [22] presented energy-efficient single-processor 5/3 and 9/7 modes LDWT [2]–[6], [9], [13], [14], [17],
and fully pipelined architectures for the 2-D 5/3 lifting-based [18], [25]–[32]. To reduce the TM, the memory access must
JPEG2000. The single processor performs both row-wise and be redirected. This paper presents a new approach, namely,
column-wise processing simultaneously to achieve the 2-D interlaced read scan algorithm (IRSA) that changes the signal
transform with 100% hardware utilization. Tan and Arslan [24] reading order from row-wise only to mixed row- and column-
presented a shift-accumulator arithmetic logic unit architecture wise, and thus reduces the TM. Given an image of size N×N,
for the 2-D lifting-based JPEG2000 5/3 DWT. This architec- the internal memory sizes of the proposed 2-D LDWT for 5/3
ture has an efficient memory organization, which uses a small and 9/7 filters are simply 2N and 4N, respectively. The pro-
amount of embedded memory for processing and buffering. posed 2-D LDWT is based on parallel and pipelined schemes
Those architectures achieve multilevel decomposition using to increase the operation speed. For hardware implementa-
an interleaving scheme that reduces the size of memory tion, it can replace the multipliers with shifters and adders
and the number of memory accesses, yet slow throughput to yield high hardware utilization. Consequently, this 2-D
rates and inefficient hardware utilization are the key issues. LDWT has the characteristics of high hardware utilization, low
Huang et al. [25] proposed a generic random access memory memory requirement, and regular signal flow. In this paper, the
(RAM)-based architecture with high efficiency and feasibil- 256×256 2-D dual-mode LDWT was designed and simulated
ity for the 2-D DWT. Wu and Lin [26] presented a high- by VerilogHDL, and further synthesized by the Synopsys
performance and low-memory architecture to implement the design compiler with TSMC 0.18-μm 1P6M CMOS process
2-D dual-mode LDWT. The pipelined signal path of their technology.
architecture is regular and practical. Lan et al. [27] proposed a
scheme that can process two lines simultaneously by process-
ing two pixels in a clock period. Wu and Chen [28] proposed II. Lifting-Based DWT
an efficient VLSI architecture for the direct 2-D LDWT, in DWT performs multiresolution decomposition of the input
which the poly-phase decomposition and coefficient folding signals [1]. The original signals are first decomposed into
are adopted to improve the hardware utilization. Despite two subspaces, called the low-frequency (low-pass) and high-
the above improvements to the existed architectures, further frequency (high-pass) subbands. The classical DWT imple-
refinements in the algorithm and architecture are still needed. ments the decomposition (analysis) of a signal by a low-
The parameters of the 9/7 filter are irrational; thus, it is very pass digital filter H and a high-pass digital filter G. Both
difficult to implement it in a simple manner. The approach of the digital filters are derived using the scaling function
proposed in [29] implemented the 9/7 filter with integer and the corresponding wavelets. The system downsamples the
techniques using shifters and adders instead of multipliers for signal to decimate half of the filtered results in decomposition
motion-JPEG2000. The integer wavelet transform can achieve processing. The z-transfer functions of H(z) and G(z), based
nearly the same performance but require much less hardware on the four-tap and nonrecursive FIR filters with length L,
cost. Mohanty and Meher [30] proposed a scalable array are
architecture based on FPGA for the 2-D 9/7 LDWT (only
9/7 filter). This is a massively parallel processing architecture, H(z) = h0 + h1 z−1 + h2 z−2 + h3 z−3 (1)
which can scalably process the block size according to the
size of the input image and generally involves TM of size 10N. G(z) = g0 + g1 z−1 + g2 z−2 + g3 z−3 . (2)
The major advantage of this architecture is the saving of frame
buffer. However, it has some inherent problems associated with The reconstruction (synthesis) process is implemented using
the block size for an input image. Since the image quality an upsampling process. Mallat’s tree algorithm or pyramid
keeps increasing in the future, the block size increases accord- algorithm [1] can be used to find the multiresolution decom-
ingly; thus, the full parallel architecture may induce very high position DWT. The decomposition DWT coefficients at each
hardware requirement, such as adder, multiplier, register, and resolution level can be calculated as
multiplexer (MUX)/demultiplexer (DeMUX). Zhang et al. [31]
j
k−1
presented a pipelined architecture for the 2-D LDWT using XH (n) = G(z)Xj−1 H(2n − i) (3)
a nonseparable approach to avoid the transposition memory i=0
and frame buffer. However, nonseparable 2-D DWT leads to j
k−1
a higher complexity control component, since it needs more XL (n) = H(z)Xj−1 G(2n − i) (4)
i=0
computation time than the separable 2-D LDWT under the
same throughput rate. In addition, it is difficult to design the where j denotes the current resolution level, k denotes the
j
desired 2-D LDWT. Thus, several VLSI architectures have number of the filter tap, XH (n) denotes the nth high-pass DWT
j
been proposed for efficient implementation of the 2-D LDWT coefficient at the jth level, and XL (n) denotes the nth low-pass
[2]–[15], [17], [18], [22]–[32]. Yet, these hardware architec- DWT coefficient at the jth level.
tures need large TM components or arithmetic components. The image is first analyzed horizontally to generate two
A low TM requirement is the priority concern in spatial- subimages. The information is then sent into the second round
frequency domain implementation. In general, raster scan 1-D DWT to perform the vertical analysis to generate four
signal flow operations are popular in N×N 2-D DWT, and subbands, and each with a quarter of size of the original
HSIA et al.: MEMORY-EFFICIENT HARDWARE ARCHITECTURE 673
TABLE I
Frame Rate of the Conventional 2-D LDWT
(Image Is of Size 640×480)
Fig. 3. 5/3 LDWT algorithm. profiling tool [35]. It is clear that the 2-D DWT requires the
second highest computation time in the JPEG2000 system.
Table I shows the frame rates of the JPEG2000 encoded using
various levels of conventional 2-D LDWT. The experimental
environment is set using an AMD 1.8-GHz CPU, 1-GB RAM,
Microsoft Windows XP SP3, and Borland C++ Builder (BCB)
6.0. The frame rate is around 5 frames per second (f/s).
Apparently, the 2-D LDWT implementation must be revised
and enhanced to meet the real-time requirement.
Fig. 6. System block diagram of the proposed 2-D DWT. (a) 2-D dual-mode LDWT. (b) Block diagram of the proposed system architecture.
the row in gray on the right is the right boundary extension L(1, 0) = [H(0, 0) + H(1, 0)]/4 + X(2, 0). The calculated high-
row. The embedded signal extension algorithm [7] can be used frequency coefficients, H(i, j), and the low-frequency coeffi-
to compute the boundary of the image. The signal extension cients, L(i, j), are then used in the second-stage 1-D DWT to
is simply required at the beginning and at the end of the calculate the four subband coefficients, HH, HL, LH, and LL.
process. For JPEG2000, the standard data extension should be In the second-stage 1-D DWT of Fig. 7, the first HH
symmetrical. The left half of Fig. 7 shows the first-stage 1-D coefficient, HH(0, 0), is obtained using H(0, 2), H(0, 1), and
DWT operations. The right half of Fig. 7 shows the second- H(0, 0), in which HH(0, 0) = −[H(0, 0)+H(0, 2)]/2+H(0, 1).
stage 1-D DWT operations for yielding the four subband The other HH coefficients can be computed likewise using
coefficients, HH, HL, LH, and LL. In the first-stage 1-D three consecutive H(i, j) signals in a column. For the two
DWT, three pixels are used to calculate the 1-D high-frequency consecutive columns HH coefficients, the additional over-
coefficient. For example, X(0, 0), X(1, 0), and X(2, 0) are lapped H(i, j) signal is involved. For example, H(0, 2) is
used to yield the high-frequency coefficient H(0, 0), H(0, 0) the overlapped signal for computing HH(0, 0) and HH(0,
= −[X(0, 0) + X(2, 0)]/2 + X(1, 0). To calculate the next 1). To compute HL coefficients, it needs two consecutive
high-frequency coefficient H(1, 0), pixels X(2, 0), X(3, 0), and columns HH coefficients and an overlapped H(i, j) signal.
X(4, 0) are employed. The first pixel X(2, 0), which is called For example, HL(0, 1) is computed from HH(0, 0), HH(0, 1),
the overlapped pixel, is used to calculate both the H(0, 0) and H(0, 2), HL(0, 1) = [HH(0, 0) + HH(0, 1)]/4 + H(0, 2).
and the H(1, 0). The low-frequency coefficient is calculated The LH coefficients are computed from the L(i, j) signal and
using two consecutive high-frequency coefficients and the each LH coefficient is obtained from three L(i, j) signals.
overlapped pixel. For example, H(0, 0) and H(1, 0) along with For example, LH(0, 0) is computed from L(0, 0), L(0, 1),
X(2, 0) are used to calculate the low-frequency coefficient and L(0, 2), LH(0, 0) = −[L(0, 0) + L(0, 2)]/2 + L(0, 1). For
676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013
Fig. 7. Example of 2-D 5/3 mode LDWT operations. X(i, j): original image, i = 0 to 5 and j = 0 to 5. H(i, j): high-frequency wavelet coefficient of 1-D
LDWT. L(i, j): low-frequency wavelet coefficient of the 1-D LDWT. HH(i, j): high–high frequency wavelet coefficient of the 2-D LDWT. HL(i, j): high–low
frequency wavelet coefficient of the 2-D LDWT. LH(i, j): low–high frequency wavelet coefficient of the 2-D LDWT. LL(i, j): low–low frequency wavelet
coefficient of the 2-D LDWT.
the two consecutive columns LH coefficients, the additional IRSA is based on this idea, which can significantly reduce
overlapped L(i, j) signal is involved. For example, L(0, 2) is the requirement of the TM. The block diagram of the IRSA
the overlapped signal for computing LH(0, 0) and LH(0, 1). is shown in Fig. 8, in which the numbers on the top and left
To compute LL coefficients, two consecutive columns LH represent the coordinate indexes of a 2-D image. To increase
coefficients and an overlapped L(i, j) signal are required. For the operation speed, each time the IRSA scans two pixels in
example, LL(0, 1) is computed from LH(0, 0), LH(0, 1), and the consecutive rows. IN1 (initial in X(0, 0)) and IN2 (initial
L(0, 2), LL(0, 1) = [LH(0, 0) + LH(0, 1)]/4 + L(0, 2). in X(0, 1)) are the scanning inputs at the beginning. At the first
From the descriptions of the operations of the 2-D 5/3 mode clock, the system scans two pixels, X(0, 0) and X(0, 1), from
LDWT, we find that each 1-D high-frequency coefficient, IN1 and IN2, respectively. At the second clock, IN1 and IN2
H(i, j), is calculated from three image signals, and one of read pixels X(1, 0) and X(1, 1), respectively. At clock 3, IN1
the image signal is overlapped with the previous H(i, j). The and IN2 read pixels X(2, 0) and X(2, 1), respectively. As IN1
1-D low-frequency coefficient, L(i, j), is calculated from the and IN2 have read three pixels, the DWT processor computes
two consecutive rows in H(i, j) and an overlapped pixel. The the two 1-D high-frequency coefficients, H(0, 0) and H(0, 1),
HH, HL, LH, and LL coefficients are computed from H(i, j) and then stored in the TM for the subsequent computation of
and L(i, j). If we can change the scanning order of the first- the low-frequency coefficients. Pixels X(2, 0) and X(2, 1) are
stage 1-D LDWT and the output order of the second-stage 1-D stored in the internal memory for the subsequent computation
LDWT, during the 2-D LDWT operation we simply need to of the 1-D high-frequency coefficients.
store the H(i, j) to the TM [first-in first-out (FIFO) of size] At clock 4, the DWT processor scans pixels in rows 2 and 3,
and the overlapped pixels to the internal memory (R4 + R9 and IN1 and IN2 read pixels X(0, 2) and X(0, 3), respectively.
of size N). For an image of size N×N, the size of the TM At clock 5, IN1 and IN2 read pixels X(1, 2) and X(1, 3),
block can be reduced to 2N, as shown in Fig. 8. The proposed respectively. At clock 6, IN1 and IN2 read pixels X(2, 2)
HSIA et al.: MEMORY-EFFICIENT HARDWARE ARCHITECTURE 677
Fig. 10. Detailed operations of the second-stage 1-D DWT. (a) Operations for the H.F (HH and HL) part. (b) Operations for the L.F (LH and LL) part.
Fig. 17. Input signal sequences. (a) IN1 read signal of even row in zigzag
order. (b) IN2 read signal of odd row in zigzag order.
1-D DWT, control unit, and the MAC unit. Fig. 18 shows the
block diagram of the proposed 2-D LDWT that consists of the
two stages: the first-stage 1-D DWT and the second-stage 1-D
DWT. This architecture simply needs a small amount of TM.
Fig. 15. Block diagram of the second-stage 1-D LDWT. According to (12) and (13), the proposed IRSA architecture
can also be applied to the 9/7 mode LDWT. Fig. 19 illustrates
the approach. According to Figs. 9 and 10, the original signals
(denoted as black circles) for the both 5/3 and 9/7 mode
LDWTs can be processed by the same IRSA for the first-stage
1-D DWT operation. The high-frequency signals (denoted as
gray circles), the correlated low-frequency signals, and the
results of the first stage are used to compute the second-stage
1-D DWT coefficients. Compared to the 9/7 mode LDWT
computation, the 5/3 mode LDWT is much easier for computa-
tion, and the registers arrangement as shown in Figs. 11 and 15
is simple. For 9/7 mode LDWT implementation with the
same system architecture of 5/3 mode LDWT, the following
modifications are required.
1) The control signals of the MUX in Figs. 11 and 15 must
be modified by rearranging the registers of the MAC
Fig. 16. Signal merging process for the signal arrangement unit. block to process the 9/7 parameters.
2) The wavelet coefficients of the dual-mode LDWT are
different. The coefficients for 5/3 mode LDWT are
C. 2-D LDWT Architecture α = −1/2 and β = 1/4, but the coefficients are α =
In the proposed IRSA operation, IN1 and IN2 read signals −1.586134142, β = −0.052980118, γ = +0.882911075,
of even row and odd row in a zigzag order, respectively, as and δ = +0.443506852 for 9/7 mode LDWT. For calcula-
shown in Fig. 17. Fig. 18 shows the complete architecture tion simplicity, the integer approach proposed by Hwang
of the 2-D LDWT, in which four parts are involved: two et al. [23] is employed. Similar to the multiplication
sets of the first-stage 1-D DWT, two sets of the second-stage implementation using shifters and adders in the 5/3
680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013
Fig. 19. Processing procedures of 2-D dual-mode LDWTs under the same IRSA architecture.
TABLE II
Comparisons of One-Level 2-D Architectures for 5/3 LDWT
Proposed Ours [2] [3] [4] [5] [6] [13] [18] [25] [26]
TM (bytes) 2N 3.5N 3.5N 2.5N 3N N 2 /4 + 5N N2 2N 3.5N 3.5N
Latency (3/2)N + 3 – 2N + 5 3 – 3 – – – –
CT (3/4)N 2 + (3/2)N + 7 – (N 2 /2) + N + 5 N2 (N 2 /2) + N + 5 N2 – (N 2 /2)+N – 10 + (4/3)N 2 [1−(1/4)]
+ 2N[1 − (1/2)]
Adders 8 12 8 6 5 4 12 8 – –
Multipliers 0 6 4 4 0 0 6 0 – 6
TM size is used to store frequency coefficients in the one-level 2-D DWT.
In a system, latency is often used to mean any delay or waiting time that increases real or perceived response time beyond the response time desired. For
example, specific contributors to 2-D DWT latency include from original image input to first subband output in signal.
In a system, CT represents the time used to compute an image of size N × N.
Suppose the image is of size N × N.
TABLE III
Comparisons of One-Level 2-D Architectures for 9/7 LDWT
Proposed Ours [3] [9] [10] [14] [17] [25] [26] [27] [28]
TM (bytes) 4N N2 12N N 2 /4 + LN + L 22N 14N 5.5N 5.5N – N 2 + 4N + 4
Latency (3/2)N + 3 – – – – – – – – –
CT (3/4)N 2 + (3/2)N + 7 4N2 /3 + 2 N2 N 2 /2∼(2/3)N N2 – – 22 + (4/3)N 2 [1 − (1/4)] – 2N2 /3
+6N[1 − (1/2)]
Adders 16 8 12 4L 36 16 16 8 32 16
Multipliers 0 4 9 4L 36 12 10 6 20 16
TM size is used to store frequency coefficients in the 1-L 2-D DWT.
In a system, latency is often used to mean any delay or waiting time that increases real or perceived response time beyond the response
time desired. For example, specific contributors to 2-D DWT latency include from original image input to first subband output in signal.
In a system, CT represents the time used to compute an image of size N × N.
Suppose the image is of size N × N.
L: the filter length.
TABLE IV
Hardware Cost and Performance Comparisons of Various 9/7 2-D LDWT High-Throughput Architectures
proposed architecture and former architectures in the literature. the TM requirement and shorten the critical path (CP), the
Comparison results indicate that the proposed VLSI architec- proposed IRSA is proposed to change the signal flow from
ture outperforms the previous works in terms of TM size, in row-wise to mixed row- and column-wise. The IRSA is in
particular around 50% less memory requirement than that of parallel and pipelined orientation, which can improve the
the JPEG2000 standard [13]. Moreover, the 2-D LDWT is processing speed significantly. In addition, the TM of an image
frame based, and its implementation bottleneck is the huge of size N × N simply needs 2N or 4N for 5/3 or 9/7 schemes,
TM. In this paper, less memory units are needed with the respectively, with the new approach.
proposed architecture, and the latency is fixed at (3/2)N + 3 In the proposed high-throughput 2-D LDWT architecture,
clock cycles. Chen and Wu [4] proposed a folded and pipelined we have considered the tradeoffs between the TM and the
architecture to compute the 2-D 5/3 LDWT, and the TM is of CP in VLSI implementation. Table IV shows the hardware
size 2.5N for the N×N 2-D DWT. This lifting architecture cost and typical performance comparisons of various 9/7 high-
for vertical filtering with two adders and one multiplier is throughput 2-D DWTs in terms of the number of multipliers
divided into two parts, and each part has one adder and one and adders, TM size, CP, and throughput. The length of the
multiplier. Because both parts are activated in different cycles, input signal is 8 b. The lengths of the adder, multiplier, and
the same adder and multiplier can be shared to increase the register are 16 b with 11 integer bits and 5 fractional bits
hardware utilization and reduce the latency. As mentioned [33]. For the proposed 2-D LDWT architecture, the TM is
above, according to the characteristics of the signal flow 4N and the CP is 2Tm + 4Ta (Tm and Ta denote the delay
operation, it may increase the operation complexity. To reduce time of the multiplier and adder, respectively). The column
682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013
TABLE V had a mixed row- and column-wise signal flow, rather than
Design Specification of the Proposed 2-D LDWT purely row-wise as in traditional 2-D LDWT. Moreover, a
new approach, namely, IRSA, was proposed to reduce the
Chip specification N = 256, tile size = 256 × 256 TM requirement for the 2-D dual-mode LDWT. The proposed
Gate count 29 196 gates 2-D architectures were more efficient than former architectures
Power supply 1.8 V
in trading off low TM requirement, output latency, control
Technology TSMC 0.18-μm 1P6M (CMOS)
Wireload model TSMC 0.18− wl10 complexity, and regular memory access sequence. The pro-
Latency (3/2)N + 3 = 387 clock cycle posed architecture reduced the TM significantly to a memory
TM size 2-D 5/3 DWT: 1024 bytes size of only 2N or 4N (5/3 or 9/7 mode), and reduced the
2-D 9/7 DWT: 2048 bytes latency to (3/2)N + 3. Due to the regularity and simplicity
Power estimation 15.47 mW of the IRSA LDWT architecture, a dual mode (5/3 and 9/7)
CT (3/4)N 2 + (3/2)N + 7 = 49 543 clock cycle 256 × 256 2-D LDWT prototyping chip was designed by
Maximum clock rate 100 MHz
TSMC 0.18-μm 1P6M standard CMOS technology. The 5/3
and 9/7 filters with different lifting steps were realized by
processor needs eight registers for storing the original pixels cascading the four modules (split, predict, update, and scaling
and coefficients of X(i, j), H(i, j), and L(i, j). Each of the phases). The prototyping chip took 29 196 gate counts and
row and column processors needs zero multipliers and eight could operate at 100 MHz. The method was applicable to any
adders. DWT-based signal compression standard, such as JPEG2000,
The delays of a 16-b multiplier and adder are 6.79 and motion-JPEG2000, MPEG-4 still texture object decoding, and
3.01 ns, respectively. The gate counts of a 16-b multiplier wavelet-based SVC.
are 762 [34]. The gate counts of the 256 × 16 static RAM
(1N = 256 × 16) are 5252 [32]. According to Table IV, the References
flipping architecture [23] and the generic RAM-based archi-
[1] S. G. Mallat, “A theory for multi-resolution signal decomposition: The
tecture [25] require identical TM of 5.5N, and CPs of 1Tm wavelet representation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 11,
+ 5Ta (21.84 ns) and 4Tm + 8Ta (51.24 ns), respectively. The no. 7, pp. 674–693, Jul. 1989.
architecture proposed by Mohanty and Meher [30] requires [2] C. Diou, L. Torres, and M. Robert, “An embedded core for the 2-D
wavelet transform,” in Proc. IEEE Emerg. Technol. Factory Automat.,
TM of size 10N (52 520 gates), while the proposed design Oct. 2001, pp. 179–186.
simply needs 4N. However, its CP of [30] is 1Tm + 2Ta [3] K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI architecture for
(12.81 ns), which is much faster than the proposed architecture lifting-based forward and inverse wavelet transform,” IEEE Trans. Signal
Process., vol. 50, no. 4, pp. 966–977, Apr. 2002.
of 2Tm + 4Ta (25.62 ns). Yet, when designing the VLSI [4] S.-C. Chen and C.-C. Wu, “An architecture of 2-D 3-level lifting-based
architecture of the 2-D dual-mode LDWT, we must consider discrete wavelet transform,” in Proc. VLSI Des./CAD Symp., Aug. 2002,
the tradeoff between TM and computation complexity. Herein, pp. 351–354.
[5] P.-Y. Chen, “VLSI implementation of lifting discrete wavelet transform
27 multipliers are used in [30], while simply zero multipliers using the 5/3 filter,” IEICE Trans. Inform. Syst., vol. E85-D, no. 12, pp.
are required with the proposed scheme. Our approach can thus 1893–1897, Dec. 2002.
save hardware cost significantly. [6] J.-S. Chiang and C.-H. Hsia, “An efficient VLSI architecture for 2-D
DWT using lifting scheme,” in Proc. IEEE Int. Conf. Syst. Signals, Apr.
According to Table IV, the proposed 2-D LDWT architec- 2005, pp. 528–531.
ture outperforms the former systems (high-throughput archi- [7] K. C. B. Tan and T. Arslan, “Low power embedded extension algorithm
tectures) [23], [25], [30] in terms of TM size and CP. This 2-D for the lifting based discrete wavelet transform in JPEG2000,” IET
Electron. Lett., vol. 37, no. 22, pp. 1328–1330, Oct. 2002.
DWT architecture is lifting based such that the reduction of the [8] W. Jiang and A. Ortega, “Lifting factorization-based discrete wavelet
TM is significant. In addition, parallel and pipelining schemes transform based architecture design,” IEEE Trans. Circuits Syst. Video
are adopted to reduce the internal memory requirement. Technol., vol. 11, no. 5, pp. 651–657, May 2001.
[9] G.-C. Jung and S.-M. Park, “VLSI implement of lifting wavelet trans-
A 256 × 256 2-D LDWT was designed and simulated with form of JPEG2000 with efficient RPA (recursive pyramid algorithm)
VerilogHDL and further synthesized by the Synopsys design realization,” IEICE Trans. Fundamentals, vol. E88-A, no. 12, pp. 3508–
compiler with TSMC 0.18-μm 1P6M CMOS standard process 3515, Dec. 2005.
[10] P.-Y. Chen, “VLSI implementation for one-dimensional multilevel
technology to verify the performance of the proposed hardware lifting-based wavelet transform,” IEEE Trans. Comput., vol. 53, no. 4,
architecture. The detailed specifications of the 256 × 256 2-D pp. 386–398, Apr. 2004.
LDWT are listed in Table V. There have been increasing [11] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Analysis and VLSI archi-
tecture for 1-D and 2-D discrete wavelet transform,” IEEE Trans. Signal
concerns regarding the efficacy of wireload models as deep- Process., vol. 53, no. 4, pp. 1575–1586, Apr. 2005.
submicrometer (0.18-μm process technology in this paper) [12] I. Daubechies and W. Sweldens, “Factoring wavelet transforms into
parasitic interconnections. The TSMC 0.18− wl10 script is lifting steps,” J. Fourier Anal. Applicat., vol. 4, no. 3, pp. 247–269,
1998.
used for the wireload model to simulate the performance. This [13] Information Technology, JPEG 2000 Part 1. Final Committee Draft
approach can also be extended to acquire power-efficient 2-D Version 1.0, document ISO/IEC 15444-1 JTC1/SC29 WG1, 2000.
LDWT architectures. [14] M. Vishwanath, R. M. Owens, and M. J. Irwin, “VLSI architecture for
the discrete wavelet transform,” IEEE Trans. Circuits Syst. II, vol. 42,
no. 5, pp. 305–316, May 1995.
[15] JPEG 2000 Verification Model 9.0, document ISO/IEC JTC1/SC29/
VI. Conclusion WG1 Wgln 1684, 2000.
[16] S. G. Mallat, “Multi-frequency channel decompositions of images and
This paper presented a new architecture to reduce the TM wavelet models,” IEEE Trans. Acoust., Speech Signal Process., vol.
requirement of the 2-D LDWT. The proposed architecture ASSP-37, no. 12, pp. 2091–2110, Dec. 1989.
HSIA et al.: MEMORY-EFFICIENT HARDWARE ARCHITECTURE 683
[17] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Efficient VLSI architec- Chih-Hsien Hsia (M’10) was born in Taipei, Tai-
ture of lifting-based discrete wavelet transform by systematic design wan, in 1979. He received the B.S. degree in com-
method,” in Proc. IEEE Int. Symp. Circuits Syst., May 2002, pp. 26– puter science and information engineering from the
29. Taipei Chengshih University of Science and Tech-
[18] K. Mei, N. Zheng, and H. van de Wetering, “High-speed and memory- nology, Taipei, in 2003, and the M.S. degree in
efficient VLSI design of 2-D DWT for JPEG2000 applications,” IET electrical engineering and the Ph.D. degree from
Electron. Lett., vol. 42, no. 16, pp. 907–908, Aug. 2006. Tamkang University, New Taipei, Taiwan, in 2005
[19] Information Technology, Motion JPEG2000, document ISO/IEC and 2010, respectively.
ISO/IEC 15444-3, 2002. From July 2007 to September 2007, he was a
[20] Information Technology, Coding of Moving Picture and Audio, docu- Visiting Scholar with Iowa State University, Ames.
ment ISO/IEC JTC1/SC29 WG11, 2001. He is currently a Post-Doctoral Research Fellow
[21] P. Chen and J. W. Woods, “Bidirectional MC-EZBC with lifting imple- with the Multimedia Signal Processing Laboratory, Graduate Institute of
mentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 10, Electrical Engineering, National Taiwan University of Science and Technol-
pp. 1183–1194, Oct. 2004. ogy, Taipei. He is an Adjunct Associate Professor with the Department of
[22] H. Varshney, M. Hasan, and S. Jain, “Energy efficient novel architectures Electrical Engineering, Tamkang University. His current research interests
for the lifting-based discrete wavelet transform,” IET Image Process., include digital signal processing integrated circuit design, image and video
vol. 1, no. 3, pp. 305–310, Sep. 2007. processing, multimedia compression system design, multiresolution signal
[23] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Flipping structure: An processing algorithms, and computer and robot vision processing.
efficient VLSI architecture for lifting-based discrete wavelet transform,” Dr. Hsia is a member of the Phi Tau Phi Scholastic Honor Society.
IEEE Trans. Signal Process., vol. 52, no. 4, pp. 1080–1089, Apr.
2004.
[24] K. C. B. Tan and T. Arslan, “Shift-accumulator ALU centric JPEG 2000
5/3 lifting based discrete wavelet transform architecture,” in Proc. IEEE Jen-Shiun Chiang (M’90) received the B.S. degree
Int. Symp. Circuits Syst., May 2003, pp. V161–V164. in electronics engineering from Tamkang University,
[25] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Generic RAM-based archi- New Taipei, Taiwan, in 1983, the M.S. degree in
tectures for two-dimensional discrete wavelet transform with line-based electrical engineering from the University of Idaho,
method,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 7, pp. Moscow, in 1988, and the Ph.D. degree in electrical
910–919, Jul. 2005. engineering from Texas A&M University, College
[26] B.-F. Wu and C.-F. Lin, “A high-performance and memory-efficient Station, in 1992.
pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of In 1992, as an Associate Professor, he joined the
JPEG2000 codec,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, faculty of the Department of Electrical Engineering,
no. 12, pp. 1615–1628, Dec. 2005. Tamkang University, where he is currently a Pro-
[27] X. Lan, N. Zheng, and Y. Liu, “Low-power and high-speed VLSI fessor. His current research interests include digital
architecture for lifting-based forward and inverse wavelet transform,” signal processing for very large-scale integration architectures, architecture
IEEE Trans. Consumer Electron., vol. 51, no. 2, pp. 379–385, May for image data compressing, system-on-chip design, analog-to-digital data
2005. conversion, and low-power circuit design.
[28] P.-C. Wu and L.-G Chen, “An efficient architecture for two-dimensional
discrete wavelet transform,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 11, no. 4, pp. 536–545, Apr. 2001.
[29] W.-M. Li, C.-H. Hsia, and J.-S. Chiang, “Memory-efficient architecture Jing-Ming Guo (M’06–SM’10) was born in Kaoh-
of 2-D dual-mode lifting scheme discrete wavelet transform for motion- siung, Taiwan, on November 19, 1972. He received
JPEG2000,” in Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp. the B.S.E.E. and M.S.E.E. degrees from National
750–753. Central University, Taoyuan, Taiwan, in 1995 and
[30] B.-K. Mohanty and P. K. Meher, “Memory efficient modular VLSI archi- 1997, respectively, and the Ph.D. degree from the
tecture for highthroughput and low-latency implementation of multilevel Institute of Communication Engineering, National
lifting 2-D DWT,” IEEE Trans. Signal Process., vol. 59, no. 5, pp. 2072– Taiwan University, Taipei, Taiwan, in 2004.
2084, May 2011. From 1998 to 1999, he was an Information Tech-
[31] C. Zhang, C. Wang, and M. O. Ahmed, “A pipeline VLSI architecture nique Officer in the Chinese Army. He is currently
for fast computation of the 2-D discrete wavelet transform,” IEEE Trans. a Professor with the Department of Electrical Engi-
Circuit Syst. I, vol. 59, no. 8, pp. 1775–1785, Aug. 2012. neering, National Taiwan University of Science and
[32] M. Maamoun, M. Neggazi, A. Meraghni, and D. Berkani, “VLSI design Technology, Taipei. His current research interests include multimedia signal
of 2-D discrete wavelet transform for area-efficient and high-speed processing, multimedia security, computer vision, and digital halftoning.
image computing,” World Acad. Sci., Eng. Technol., vol. 35, pp. 538– Dr. Guo was a recipient of the Outstanding Youth Electrical Engineer Award
543, 2008. from the Chinese Institute of Electrical Engineering in 2011, the Outstanding
[33] B.-D. Choi, K.-S. Choi, M.-C. Hwang, J.-K. Cho, and S.-J. Ko, “Real- Young Investigator Award from the Institute of System Engineering in 2011,
time DSP implementation of motion-JPEG2000 using overlapped block the Best Paper Award from the IEEE International Conference on System
transferring and parallel-pass methods,” Real-Time Imag., vol. 10, no. Science and Engineering in 2011, the Excellence in Teaching Award in 2009,
5, pp. 277–284, Oct. 2004. the Research Excellence Award in 2008, the Acer Dragon Thesis Award in
[34] A. Bellaouar and M. I. Elmasry, Low-Power Digital VLSI Design 2005, the Outstanding Paper Award from IPPR Computer Vision, Graphics,
Circuits and Systems. Norwell, MA: Kluwer, 1995. and Image Processing in 2005 and 2006, and the Outstanding Faculty Award
[35] H. Muta, M. Doi, H. Nakano, and Y. Mori, “Multilevel parallelization on in 2002 and 2003. From 2003 to 2004, he was granted the National Science
the cell/B.E. for a motion JPEG 2000 encoding server,” in Proc. ACM Council Scholarship for advanced research from the Department of Electrical
Workshops Multimedia, Sep. 2007, pp. 942–951. and Computer Engineering, University of California, Santa Barbara.