You are on page 1of 2

Unified VLSI architecture for photo core

transform used in JPEG XR


Shuiping Zhang, Xin Tian, Chengyi Xiong and Jinwe Tian
A unied very large-scale integration (VLSI) architecture with butteries that can perform photo core transform (PCT) in JPEG XR image
compression is presented. The proposed architecture can achieve the
unied architecture design, which supports the three elemental operations of PCT, and it has the characteristics of lower hardware cost,
shorter critical path, lower power consumption, more efcient hardware utilisation and regular structure for VLSI implementation.
Finally, the implementation on Altera eld programmable gate array
(FPGA) devices validates the effectiveness of the design.

Introduction: JPEG XR [1, 2], an emerging image coding standard


from the JPEG committee, is growing in popularity owing to its
high image quality and compression capability, while requiring low
computation cost. It is designed to support high dynamic range and
high denition formats. On the basis of the above observation, it is
very suitable for real-time embedded applications [3, 4], especially
digital cameras. Therefore, the demands of very large-scale integration
(VLSI) implementation have become more and more important. JPEG
XR uses a hierarchical two-stage lapped biorthogonal transform to
convert image data to the frequency domain. The transform is based
on two key operations: the photo core transform (PCT) and the
photo overlap transform (POT). It requires only a small number of
integer processing operations for the lifting structure. At two transform
stages, the POT is optional, whereas the PCT is always performed.
Owing to its wide application, the high performance hardware
implementations of PCT have become the research focus. In most of
these works, the transform is implemented by lifted-based architectures
[5, 6] and a co-design architecture [7] is put forward for further
increasing the processing rate. However, these implementations
characterise three parallel hardware architectures, and cannot achieve
complete hardware utilisation. Moreover, the above designs cannot
achieve the unied architecture, and the hardware cost cannot be efciently reduced.
In this Letter, a unied architecture that is capable of implementing
the three basic operations of PCT is proposed for the rst time. We
attempt to design an efcient architecture with lower area cost, lower
power consumption and shorter path latency. We also aim to achive
VLSI realisation with simple additions and multiplexers.
Algorithm description: The PCT is built using the three operators [2]:
two-dimensional (2D) 2 2 Hadamard transform T_h, 1D rotation
T_odd and 2D rotation T_odd_odd. According to the lifted operations,
T_h can be viewed as two steps, that is

a3 = (a1 + c1 )/2
a1 = a + d

b1 = b c
b3 = (d1 + b1 )/2

c =b+c
c = (d1 b1 )/2

1
3
d1 = a d
d3 = (a1 c1 )/2

(1)

Similarly, T_odd_odd can be as given below:

a1 = (a d)/2
a2 = 23/32a1 + 156/256b1

b2 = 23/32a1 3/4b1
b1 = (b + c)/2

c = c1
c =bc

1
2
d2 = d1
d1 = a + d

a3 = a2 + d2 /2

b3 = b2 c2 /2

c = b2 + c2 /2

3
d3 = d2 /2 a2

By observing the operations, we nd they share similar computation


steps. Thus, we can obtain a unied algorithm for these operations,
and we cascade three modules, arriving at

a2 = x11 a1 + x12 b1
a1 = (a d)/2

b = x21 a1 + x22 b1
b1 = (b + c)/2
step 1 2
c = x31 c1 + x32 d1
c1 = b c

2
d2 = x41 c1 + x42 d1
d1 = a + d

a3 = a2 + d2 /2

b = b2 c2 /2
step 2 3
(4)
c = b2 + c2 /2


3
d3 = d2 /2 a2
where xij are the corresponding coefcients of the second module, i
{1, 2, 3, 4}, j {1, 2}. Therefore, we can describe the algorithm as
three modes, and each mode has its coefcient matrix. The coefcient
matrices of Hadamard transform, 1D rotation and 2D rotation are
given as

1 0
55/64 3/8
0 1
3/8
1

Xh =
1 0 , X odd = 55/64 3/8 ,
0 1
3/8
1
(5)

23/32 156/256
3/4
23/32

Xoddodd =
1

0
0
1
For different modes, the sequence of data ow should be changed. In
step1, let a1 and d1 exchange their values, and so do b1 and c1 for 1D
rotation mode. In step2, let a2, b2, c2, d2, respectively, correspond to
d2, c2, a2, b2 for Hadamard mode and 1D rotation mode.
VLSI architecture: The computational attractiveness of the previous
operations is that they require additions, and/or right shifts and/or multiplications. However, these operators are implemented by three architectures. Here, we consider that these operations are in the same
category. According to the above algorithm, a unied VLSI architecture
with three stages of pipeline, denoted as dotted lines, is presented as
shown in Fig. 1. The simple multiplexers are applied to the exchanges
of data ow.
a
d
b

The rounding factor is set to zero in the Hadamard transform. Let [a b c


d] and [a3 b3 c3 d3] denote the input and output data vectors of the operations. [a1 b1 c1 d1] and [a2 b2 c2 d2] are the intermediate vectors.
Moreover, T_odd can be modied in the following form with three
steps:

a2 = 55/64a1 + 3/8b1
a1 = a + d

b2 = b1 3/8b1
b1 = b c
z
c = 55/64c1 + 3/8d1
c = (b + c)/2

1
2
d2 = d1 3/8c1
d1 = (a d)/2

a3 = a2 /2 + c2

b3 = b2 /2 d2

c = c2 a2 /2

3
d3 = d2 b2 /2

(2)

(3)

x11
x

21
x 12

+ >>1
+

x22
x31
x

>>1

41

x 32

+ >>1

x42

>>1

Fig. 1 Unied architecture for PCT

Fig. 1 represents a three-stage butteries structure with only trivial


adders and shifters. The module in the dotted box can be performed
by shifters and adders or multipliers. According to the coefcient
matrices, our endeavour for reduced-complexity implementation is
focused on minimisation of the adders and critical path, and a part of
the coefcients calculation is given in Fig. 2. For the above purpose,
the module in the dotted box can be realised using 16 adders and 4 multiplexers. Therefore, the critical path can be calculated as 2Ta with three
stages of pipeline, where Ta is the latency of the addition operation.
Moreover, the lifted-based architecture [6] can be utilised to implement
the module. We note that the lifted-based architecture can reduce two

ELECTRONICS LETTERS 16th April 2015 Vol. 51 No. 8 pp. 628630

adders, but prolong the critical path latency to 3Ta with the same stage of
pipeline.

+
a1

>>2

>>1

x11a1

+
>>2

>>1
+
0

>>1

Conclusion: In this Letter, a unied architecture with butteries for


PCT has been proposed. The architecture possesses a simple controller
and is performed with only adders and multiplexers. Therefore, it is very
favourable for VLSI implementation. Performance evaluation for this
design indicates that the proposed architecture has the benets of low
cost, low power consumption and efcient hardware utilisation. We conclude that the unied architecture is a good one for the implementation
of PCT.

x21a1

Fig. 2 Block diagram of coefcient calculation

Results and discussion: To evaluate the performance of the proposed


unied architecture, different PCT architectures are compared in terms
of hardware complexity, hardware utilisation, computing complexity,
critical path and system power consumption. The shift operations are
implemented by hardwire. Therefore, the number of adders and multipliers is mainly used to estimate the hardware complexity. Compared
with the previous architectures which have about 50% hardware
utilisation for 1D and 2D rotation modules, the hardware utilisation of
the proposed architecture can reach 100%. The computing time is
normalised as intra-clock cycles, and can be easily calculated as 13
for a 4 4 PCT, which is two more cycles than the previous designs
with the same three stages of pipeline. However, the number of registers
required and the critical path of our proposed architecture is reduced
efciently. Performance comparison results, in general, are listed in
Table 1. Tm denotes the latency of multiplication operation.

Table 1: Comparisons with previous designs


Architectures [5, 6] [7] Proposed
Adders
40
28
24
Multipliers
0
20
0
Critical path
4Ta Tm
2Ta

To verify the proposed architecture, an Altera eld programmable


gate array (FPGA) EP2S30F672I4 FPGA target was used. The experimental results are summarised in Table 2. The results concern the consumption of logic cells and registers. The architecture uses less area due
to the capability of supporting three basic operations and the effective
utilisation of the coefcients in the unied mapping. The critical path
and dynamic power consumption are also displayed. The dynamic
power consumption of the designs is 5.7 mW operating at 50 MHz.
The synthesised results show that the proposed unied architecture
can not only shorten the critical path, but also have a signicant
reduction in hardware cost and power consumption.

Table 2: Experimental results among previous designs

Acknowledgment: This work was supported by the National Natural


Science Foundation of China under grants 61273279, 61102064 and
61471400.
The Institution of Engineering and Technology 2015
29 October 2014
doi: 10.1049/el.2014.3861
Shuiping Zhang and Jinwe Tian (School of Automation, Huazhong
University of Science and Technology, No. 1037 Luoyu Road, Wuhan
430074, Hubei, Peoples Republic of China)
E-mail: times_zhang@hust.edu.cn
Xin Tian (School of Electronic Information, Wuhan University, No. 299
Ba Yi Road, Wuhan 430072, Hubei, Peoples Republic of China)
Chengyi Xiong (School of Electronic and Information Engineering,
South-Central University for Nationalities, No. 182 Minyuan Road,
Wuhan 430074, Hubei, Peoples Republic of China)
References
1 Dufaux, F., Sullivan, G.J., and Ebrahimi, T.: The JPEG XR image
coding standard [standards in a nutshell], IEEE Signal Process. Mag.,
2009, 26, (6), pp. 195199, 204-204, doi: 10.1109/MSP.2009.934187
2 ISO/IEC 29199-2: Information technology JPEG XR image coding
system--Part 2: Image coding specication, 2012
3 Perra, C.: Re-encoding JPEG images for smart phone applications.
Proc. IEEE 21st Telecommunications Forum, Belgrade, Yugoslavia,
November 2013, pp. 955958, doi: 10.1109/TELFOR.2013.6716389
4 Pan, C.H., Chien, C.Y., Chao, W.M., Huang, S.C., and Chen, L.G.:
Architecture design of full HD JPEG XR encoder for digital photography applications, IEEE Trans. Consum. Electron., 2008, 54, (3),
pp. 963971, doi: 10.1109/TCE.2008.4637574
5 Maalouf, A., and Larabi, M.C.: Low-complexity enhanced lapped transform for image coding in JPEG XR/HD Photo. 16th IEEE Int. Conf. on
Image Processing, Cairo, Egypt, November 2009, pp. 58, doi: 10.1109/
ICIP.2009.5413933
6 Tu, C.J., Srinivasan, S., Sullivan, G.J., Regunathan, S., and Malvar, H.S.:
Low-complexity hierarchical lapped transform for lossy-to-lossless
image coding in JPEGXR/HD photo. Proc. SPIE Applications of
Digital Image Processing XXXI, San Diego, USA, August 2008,
pp. 70730C-170730C-12, doi: 10.1117/12.797097
7 Tseng, C.-F., and Lai, Y.-T.: Hardware-software co-design architecture
for joint photo expert graphic XR encoder, IET Image Process., 2012,
6, (9), pp. 12841292, doi: 10.1049/iet-ipr.2011.0491

Architectures Logic cells Registers Critical path (ns) Power (mW)


[5, 6]
1153
817
8.002
9.38
[7]
1269
794
5.337
11.08
Proposed
766
385
4.702
5.7

ELECTRONICS LETTERS 16th April 2015 Vol. 51 No. 8 pp. 628630

You might also like