You are on page 1of 4

Region-of-Interest based Image Resolution Adaptation for

MPEG-21 Digital Item

Yiqun Hu Liang-Tien Chia Deepu Rajan


Center for Multimedia and Center for Multimedia and Center for Multimedia and
Network Technology Network Technology Network Technology
School of Computer School of Computer School of Computer
Engineering Engineering Engineering
Nanyang Technological Nanyang Technological Nanyang Technological
University, Singapore 639798 University, Singapore 639798 University, Singapore 639798
Y030070@ntu.edu.sg asltchia@ntu.edu.sg asdrajan@ntu.edu.sg

ABSTRACT size is not always an optimal adaptation because, according


The upcoming MPEG-21 standard proposes a general frame- to information theory, information within an image will be
work for augmented use of multimedia services in different lost if the image size is reduced to be lower than a certain
network environments, for various users with various termi- percentage. Ideal image adaptation in this context should
nal devices. In the context of image adaptation, terminals provide better user experience by displaying only the area
with different screen size limitation require the multime- that captures human visual attention most. Thus the avail-
dia adaptation engine to adapt image resources intelligently. able screen size will be optimally used for the most attentive
Saliency map based visual attention analysis provides some information. MPEG-21 Standard Part 7 DIA [6] describes
intelligence for finding the attention area within the image. a standardized framework to adapt format-dependent and
In this paper, we improved the standard MPEG-21 meta- format-independent multimedia resources according to ter-
data driven adaptation engine by using enhanced saliency minal capability. For image resolution adaptation, the stan-
map based visual attention model which provides a mean dard adaptation engine only supports direct resolution re-
to intelligently adapt JPEG2000 image resolution for differ- duction although it provides the description tool for region-
ent terminal devices with varying screen size according to of-interest (ROI) information. The engine itself does not
human visual attention. provide a method to automatically detect the ROI of the
requested image. In this paper, we improved the standard
MPEG-21 image adaptation engine to automatically detect
Categories and Subject Descriptors ROI of the images using enhanced visual attention model.
H.3.5 [Online Information Services]: Data Sharing The engine also auto-generates the adaptation decision based
on image ROI and terminal capability information, and fi-
General Terms nally adapts the resolution of the image to be compatible
with human visual attention intelligently. Using our image
Standardization adaptation engine, the time consuming work of manually
outlining image ROIs in the image database is automated
Keywords in real time and users with the extremely small screen mo-
MPEG-21, Image Adaptation, Saliency Map, Intelligent Res- bile devices can achieve better user experience by viewing
olution the most attentive area in limited screen size.
The rest of this paper is organized as follows. We begin
in Section 2 by briefly introducing the improved saliency
1. INTRODUCTION map based visual attention model. In Section 3, we present
With the development of multimedia and network tech- the MPEG-21 framework and the architecture of the BSD
nology, multimedia resources can be accessed by different adaptation engine. Afterward, our improved image adapta-
terminal devices in different network situations. The most tion engine using the enhanced visual attention model are
vital limitation of device terminal is the screen size. Direct described in Section 4. Experiment evaluations are given in
resolution reduction of large images to fit the terminal screen Section 5 and we conclude our paper in Section 6.

2. VISUAL ATTENTION MODEL


Permission to make digital or hard copies of all or part of this work for Attempts to understand human vision system have re-
personal or classroom use is granted without fee provided that copies are sulted in visual attention models such as saliency map based
not made or distributed for profit or commercial advantage and that copies Visual Attention Model [2] and Contrast-based Attention
bear this notice and the full citation on the first page. To copy otherwise, to Model [3]. In our system, we utilize the saliency map based
republish, to post on servers or to redistribute to lists, requires prior specific visual attention model [2] because of its biological plausibil-
permission and/or a fee.
MM’04, October 10-16, 2004, New York, New York,USA. ity. While the objective in [2] is to track several visual at-
Copyright 2004 ACM 1-58113-893-8/04/0010 ...$5.00. tentive objects dynamically using a ”winner-take-all” neural

340
network, our aim is to extract a single region which captures tion, ”⊕”, which is implemented by reduction of each map
all the attention objects. The saliency map is a topography to scale four and point-by-point addition:
combined feature map that indicates the possible attention
I¯ = ⊕4c=2 ⊕c+4
s=c+3 N(I(c, s)) (4)
area of an image. In the next two subsections, we briefly
review the process of generating the saliency map, as de-
scribed in [2]. C̄ = ⊕4c=2 ⊕c+4
s=c+3 [N(RG(c, s)) + N(BY (c, s))] (5)
For orientation, maps with the same orientation are com-
2.1 Feature extraction bined to form 6 groups. The groups are then combined
The first step to generate the saliency map is low level
feature extraction. Three low level features which represent
the characteristics of an image are used: intensity, color, Ō =

again into a single orientation conspicuity map:
N(⊕4c=2 ⊕c+4
s=c+3 N(O(c, s, θ))) (6)
texture. Each feature is subsampled and filtered by Gaus- θ∈{0 ◦ ,45 ◦ ,90 ◦ ,135 ◦ }
sian pyramids. The difference between fine and coarse scales Finally, three conspicuity maps are normalized and summed
is implemented by a “center-surround” difference operator into a final saliency map as S = 13 (N(I)
¯ + N(C̄) + N(Ō)).
”” which interpolates a coarse scale to a fine scale and then
carries out point-by-point subtraction. Here the center is 2.3 Enhancement of Visual Attention Model
represented by a pixel at scale c ∈ {2, 3, 4} and the surround
It is assumed that small object at the edges of an image is
is the corresponding pixel at scale s = c + δ, δ ∈ {3, 4}. For
unlikely to be the main attention region and the attention
intensity, an intensity image I is calculated by I=(r+g+b)/3
region closer to the center of the image is perceptually more
where r, g and b represent the red, green and blue chan-
important in human vision. We assign a weight to each pixel
nels of the input image respectively. This intensity im-
in the image. Without additive restriction, we assume the
age I is used to create a Gaussian pyramid I(σ), where
surface of the weights of the image satisfies the gaussian dis-
σ ∈ [0..8] is the scale. Six maps are calculated as follows:
tribution along both horizontal and vertical directions ((7),
I(c, s) = |I(c)  I(s)|. For color, four broadly-tuned color
(8)) and the total weight is the arithmetic mean of two di-
channels are created as R=r-(g+b)/2 for red, G=g-(r+b)/2
rections.
for green, B=b-(r+g)/2 for blue, and Y=(r+g)/2-(r-g)/2-
b for yellow from which four Gaussian pyramids R(σ), 1 1 x − µx 2
N (µx , σx2 ) = √ exp[− ( ) ] (7)
G(σ), B(σ), Y (σ), respectively, are created. Maps RG(c, s) 2πσx 2 σx
(1) and BY (c, s) (2) simultaneously account for red/green,
1 1 y − µy 2
green/red double opponency and blue/yellow, yellow/blue N (µy , σy2 ) = √ exp[− ( ) ] (8)
double opponency: 2πσy 2 σy
RG(c, s) = |(R(c) − G(c))  (G(s) − R(s))| (1) Both gaussian curves are centered at the center point of the
image by setting µx the half of the width (Width / 2) and
BY (c, s) = |(B(c) − Y (c))  (Y (s) − B(s))| (2) µy the half of the height (Height / 2). The σx and σy are
fixed to 10 so that the gaussian curve is smooth, avoiding
For orientation, the Gabor filter is used to extract the ori- sharpness which only considers the small center region of
entation information. The orientation map is obtained from the image. These weights are used to modify the saliency
I using oriented Gabor Pyramids O(σ, θ), where θ ∈ {0 ◦ , map as (9).
45 ◦ , 90 ◦ , 135 ◦ } is the preferred orientation. Orientation
features, O(c, s, θ), are calculated with each orientation θ as N (µx , σx2 ) + N (µy , σy2 )
S̄x,y = Sx,y ∗ ( ). (9)
a group of 6 maps (3): 2

O(c, s, θ) = |O(c, θ)  O(s, θ)| (3) S̄x,y is the weighted value of the saliency map at location
(x,y). Weighting the saliency map differently according to
In all, 42 feature maps are created - 6 for intensity, 12 for the position in the image, if there are tiny attention points in
color and 24 for orientation. the edges of the image, we will skip them and keep our focus
on the most important attention region. Our experiment
2.2 Saliency Map Generation result shows that this simple factor has a good effect on
The saliency map is generated from the 42 feature maps. noise reduction. The modified saliency map will now assign
A normalization operator N (·) [2] is used to globally en- different value for each point according to their topology
hance the maps with a few strong contrast peaks and sup- attention. In our image adaptation model, a simple region
press maps with numerous small peaks. It consists of three growing algorithm whose similarity threshold is defined as
steps: 1) normalize the map into a fixed range [1..M]; 2) find 30% of the gray level range in the saliency map is used to
the location of the global maximum M and compute the av- generate the smallest bounding rectangle that includes the
erage m̄ of all its other local maxima; 3) multiply the map identified attention area(s). Firstly, we take the pixels with
by (M − m̄)2 . The difference between maximum activity maximum value (one or multiple) as the seeds and execute
and the average is measured by comparing global maximum the region growing algorithm. In each growing step, the 4-
activity to the average overall activities. Large difference neighbour points are examined. If the difference between the
indicates the most active location stands out. On the con- point and the current seed is smaller than a threshold (30%
trary, small difference indicates the map contains nothing of the range of gray-level value), the point will be added
unique and is suppressed. Feature maps are combined into into the seed queue and will be grown later. The algorithm
three ”conspicuity maps” corresponding to three types of will continue until the seed queue is empty. Finally, the
feature. I¯ for intensity (4), C̄ for color (5) and Ō for orien- output are one or several separate regions and we generate
tation (6). They are calculated through across-scale addi- a smallest rectangle to include these regions.

341
3. MPEG-21 DIGITAL ITEM ADAPTATION 4. INTELLIGENT RESOLUTION ADAPTA-
The upcoming MPEG-21 framework considers the hetero- TION ENGINE
geneous network environments, different terminal devices Current image adaptation frameworks do not provide ROI
and personal characteristics of users to provide an open based resolution adaptation or require users to manually
framework for all the participants in the multimedia con- outline the ROI. Both cases are not always convenient and
sumption chain [1]. The multimedia resource is combined feasible, such as in mobile devices where it is difficult to ac-
with metadata to describe network environment, terminal curately outline an ROI. In our work, by combining MPEG-
capability and user characteristic as the fundamental unit of 21 standard framework and enhanced visual attention model,
distribution and transaction called the Digital Item. MPEG- we improved the standard image adaptation engine to auto-
21 multimedia standard defines the technology needed to matically detect the visual attention region and decide the
support Users to exchange, access, consume, trade and oth- adaptation operation in a standardized, dynamic and in-
erwise manipulate Digital Items in an efficient, transparent telligent way. The advantage of our intelligent resolution
and interoperable way [1]. adaptation engine is to preserve, as much as possible, the
Digital Item Adaptation is an important part of MPEG- most attentive (important) information of the original im-
21 standard (Part 7) [6]. It aims to achieve interopera- age while satisfying terminal screen constraints.
ble transparent access to (distributed) advanced multime- The engine utilizes the Structured Scalable Meta-formats
dia content by shielding Users from network and terminal (SSM) for Fully Content Agnostic Adaptation [4] proposed
installation, management and implementation. In the Final as a MPEG-21 reference software module by HP Research
Committee Draft of Digital Item Adaptation [6], an adap- Labs whose architecture is shown in Figure 1. The SSM
tation engine architecture using Bitstream Syntax Descrip- module adapts the resolution of JPEG2000 images accord-
tion (BSD) [5] is proposed. When the image is input into ing to their ROIs and the terminal screen constraints of the
the adaptation engine, the BinToBSD engine analyses the viewers. BSD description of the JPEG2000 image is gen-
multimedia resource and generates a format-dependent BSD erated by BSDL module [5]. The ROI is automatically de-
or a format-independent general BSD (gBSD) indicating tected using our enhanced visual attention model and adap-
high-level structure of the corresponding image bitstream. tation operation is dynamically decided by considering both
BSD(gBSD) description of Digital Item is then adapted with ROI and terminal screen size constraint. We change the res-
the help of XML Stylesheet Transformation (XSLT) accord- olution of JPEG2000 image by directly adapting JPEG2000
ing to all related information including network situation, bitstream in compressed domain. The whole adaptation pro-
terminal capability and user preference. Finally, BSDToBin cedure is described as follows. The BSD description and ROI
creates a new adapted multimedia resource according to the information are combined with image itself as a digital item.
transformed BSD(gBSD) description. More details can be When the user requests the image, its terminal constraint is
found in [5]. sent to server as a context description (XDI). Then combin-
ing XDI, BSD descrption and ROI information, the Adapta-
tion Decision-Taking Engine decide on the adaptation pro-
Usage
Converter
Constraints not
covered in cess for the image [4]. Finally, the new adapted image and
Environment Usage
Description Environment its corresponding BSD description will be generated by the
Description
Context
BSD Resource Adaptation Engine [5]. Description can be
Digital Item
Adaptation Engine
Digital
Item updated to support multiple step adaptation. A snapshot
of BSD digital item adaptation is shown in Figure 2.
XDI
Adaptation Extraction
AdaptationQoS Decision-Taking
Engine
Format Independent
Constraints
Source Parameter
Values

Content Link Processing


BSDLink
Digital Engine
Item

Adaptation
Operations AdaptationQoS &
(g)BSD
Link Transformation
Engine
(g)BSD based
Resource Adaptation
Engine
Resource (a) (b)
CDI
Extraction Resource’ (g)BSD’ BSDLink’ AdaptationQoS’
Figure 2: Example of Digital Item BSD Adaptation;
CDI
Packetization
(a) Adaptation Decision Description; (b) JPEG2000
BSD Adaptation (Green - Original BSD, Blue -
Adapted Adapted BSD).
Content
Digital Item

The intelligent ROI adaptation is decided according to the


Figure 1: MPEG-21 Digital Item Adaptation Archi- relationship between image size (Isize ), ROI size (Rsize ) and
tecture the terminal screen size Csize .

• If Csize > Isize : No adaptation, the original image is


sent to the user directly.

342
• If Rsize < Csize < Isize : Crop the ROI according to
the result of visual attention analysis, removing non- Table 1: User Study Evaluation - percentage of im-
attention areas. ages in each category

• If Csize < Rsize : Crop the attention region first and Category Failed Bad Acceptable Medium Good
reduce the region resolution to terminal screen size. Animal 0.02 0.09 0.22 0.33 0.34
(another adaptation can be performed by the adapta- People 0.01 0.11 0.22 0.30 0.36
tion engine) Scenery 0.03 0.13 0.22 0.40 0.22
Others 0.01 0.10 0.26 0.41 0.22
5. EXPERIMENT EVALUATIONS Average 0.017 0.108 0.23 0.38 0.29
600 test images were selected from different categories of
the standard Corel Photo Library. Several output exam-
87% cases are acceptable including 67% are better than ac-
ples of our intelligent visual attention based adaptation are
ceptable. Only 10% are bad and 1% are failed. Bad results
shown in Figure 3 and Figure 4. Due to the subjectivity of
are mainly because not the whole visual object is included
visual attention perspective, we applied the user study ex-
in the cropped images (eg. the head of a animal) and 1%
periment in [3] to test the effectiveness of the proposed algo-
failure rate is due to either wrong visual object identified
rithm. 8 human subjects were invited to evaluate 40 adapted
as the attention region or images like scenery shots where
images for each of the 4 categories. The users were asked to
there may not be specific visual objects. The framework
grade the adapted images from 1 (failed) to 5 (good).
works reasonably well for a general set of natural images.
From the evaluation result shown in Table 1, we found

6. CONCLUSION
Saliency map based visual attention analysis provides an
intelligent content understanding mechanism for multimedia
adaptation. In this paper, we design an improved MPEG-21
image adaptation engine for JPEG2000 using the improved
visual attention model to provide an intelligent ROI based
image resolution adaptation for different terminal devices
according to human visual attention. The advantages of
this engine over others are its capability of ROI automatic
detection and dynamic adaptation decision combining image
ROIs and terminal screen size constraint. This work can be
extended to provide Universal Multimedia Access (UMA)
services compatible with MPEG-21 standard.

7. REFERENCES
[1] J. Bormans and K. Hill. MPEG-21 Overview V.5. In
ISO/IEC JTC1/SC29/WG11/N5231, October 2002.
[2] L. Itti, C. Koch, and E. Niebur. A Model of
Saliency-Based Visual Attention for Rapid Scene
Figure 3: Example of good intelligent adaptation; Analysis. IEEE Tran on Pattern Analysis and Machine
(a) Original Image; (b) Saliency Map; (c) Cropped Intelligence, 20(11), 1998.
Image. [3] Y. Ma and H. Zhang. Contrast-based Image Attention
Analysis by using Fuzzy Growing. In Proc. ACM
Multimedia, Berkeley, CA USA, Novemember 2003.
[4] D. Mukherjee, G. Kuo, S. Liu, and G. Beretta.
Motivation and Use cases for Decision-wise BSDLink,
and a proposal for Usage Environment
Descriptor-AdaptationQoSLinking. In ISO/IEC JTC
1/SC 29/WG 11, Hewlett Packard Laboratories, April
2003.
[5] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch,
C. Timmerer, S. Devillers, and M. Amielh. Bitstream
Syntax Description: A Tool for Multimedia Resource
(a) (b) (c) Adaptation within MPEG-21. Singal Processing: Image
Communication, EURASIP, 18(8), 2003.
Figure 4: Example of bad and failed intelligent adap- [6] A. Vetro and C. Timmerer. ISO/IEC 21000-7 FCD -
tation; (a) Original Image; (b) Saliency Map; (c) Part 7: Digital Item Adaptation. In ISO/IEC JTC
Cropped Image. 1/SC 29/WG 11/N5845, July 2003.

that for different categories of images, an average of close to

343

You might also like