Professional Documents
Culture Documents
340
network, our aim is to extract a single region which captures tion, ”⊕”, which is implemented by reduction of each map
all the attention objects. The saliency map is a topography to scale four and point-by-point addition:
combined feature map that indicates the possible attention
I¯ = ⊕4c=2 ⊕c+4
s=c+3 N(I(c, s)) (4)
area of an image. In the next two subsections, we briefly
review the process of generating the saliency map, as de-
scribed in [2]. C̄ = ⊕4c=2 ⊕c+4
s=c+3 [N(RG(c, s)) + N(BY (c, s))] (5)
For orientation, maps with the same orientation are com-
2.1 Feature extraction bined to form 6 groups. The groups are then combined
The first step to generate the saliency map is low level
feature extraction. Three low level features which represent
the characteristics of an image are used: intensity, color, Ō =
again into a single orientation conspicuity map:
N(⊕4c=2 ⊕c+4
s=c+3 N(O(c, s, θ))) (6)
texture. Each feature is subsampled and filtered by Gaus- θ∈{0 ◦ ,45 ◦ ,90 ◦ ,135 ◦ }
sian pyramids. The difference between fine and coarse scales Finally, three conspicuity maps are normalized and summed
is implemented by a “center-surround” difference operator into a final saliency map as S = 13 (N(I)
¯ + N(C̄) + N(Ō)).
”” which interpolates a coarse scale to a fine scale and then
carries out point-by-point subtraction. Here the center is 2.3 Enhancement of Visual Attention Model
represented by a pixel at scale c ∈ {2, 3, 4} and the surround
It is assumed that small object at the edges of an image is
is the corresponding pixel at scale s = c + δ, δ ∈ {3, 4}. For
unlikely to be the main attention region and the attention
intensity, an intensity image I is calculated by I=(r+g+b)/3
region closer to the center of the image is perceptually more
where r, g and b represent the red, green and blue chan-
important in human vision. We assign a weight to each pixel
nels of the input image respectively. This intensity im-
in the image. Without additive restriction, we assume the
age I is used to create a Gaussian pyramid I(σ), where
surface of the weights of the image satisfies the gaussian dis-
σ ∈ [0..8] is the scale. Six maps are calculated as follows:
tribution along both horizontal and vertical directions ((7),
I(c, s) = |I(c) I(s)|. For color, four broadly-tuned color
(8)) and the total weight is the arithmetic mean of two di-
channels are created as R=r-(g+b)/2 for red, G=g-(r+b)/2
rections.
for green, B=b-(r+g)/2 for blue, and Y=(r+g)/2-(r-g)/2-
b for yellow from which four Gaussian pyramids R(σ), 1 1 x − µx 2
N (µx , σx2 ) = √ exp[− ( ) ] (7)
G(σ), B(σ), Y (σ), respectively, are created. Maps RG(c, s) 2πσx 2 σx
(1) and BY (c, s) (2) simultaneously account for red/green,
1 1 y − µy 2
green/red double opponency and blue/yellow, yellow/blue N (µy , σy2 ) = √ exp[− ( ) ] (8)
double opponency: 2πσy 2 σy
RG(c, s) = |(R(c) − G(c)) (G(s) − R(s))| (1) Both gaussian curves are centered at the center point of the
image by setting µx the half of the width (Width / 2) and
BY (c, s) = |(B(c) − Y (c)) (Y (s) − B(s))| (2) µy the half of the height (Height / 2). The σx and σy are
fixed to 10 so that the gaussian curve is smooth, avoiding
For orientation, the Gabor filter is used to extract the ori- sharpness which only considers the small center region of
entation information. The orientation map is obtained from the image. These weights are used to modify the saliency
I using oriented Gabor Pyramids O(σ, θ), where θ ∈ {0 ◦ , map as (9).
45 ◦ , 90 ◦ , 135 ◦ } is the preferred orientation. Orientation
features, O(c, s, θ), are calculated with each orientation θ as N (µx , σx2 ) + N (µy , σy2 )
S̄x,y = Sx,y ∗ ( ). (9)
a group of 6 maps (3): 2
O(c, s, θ) = |O(c, θ) O(s, θ)| (3) S̄x,y is the weighted value of the saliency map at location
(x,y). Weighting the saliency map differently according to
In all, 42 feature maps are created - 6 for intensity, 12 for the position in the image, if there are tiny attention points in
color and 24 for orientation. the edges of the image, we will skip them and keep our focus
on the most important attention region. Our experiment
2.2 Saliency Map Generation result shows that this simple factor has a good effect on
The saliency map is generated from the 42 feature maps. noise reduction. The modified saliency map will now assign
A normalization operator N (·) [2] is used to globally en- different value for each point according to their topology
hance the maps with a few strong contrast peaks and sup- attention. In our image adaptation model, a simple region
press maps with numerous small peaks. It consists of three growing algorithm whose similarity threshold is defined as
steps: 1) normalize the map into a fixed range [1..M]; 2) find 30% of the gray level range in the saliency map is used to
the location of the global maximum M and compute the av- generate the smallest bounding rectangle that includes the
erage m̄ of all its other local maxima; 3) multiply the map identified attention area(s). Firstly, we take the pixels with
by (M − m̄)2 . The difference between maximum activity maximum value (one or multiple) as the seeds and execute
and the average is measured by comparing global maximum the region growing algorithm. In each growing step, the 4-
activity to the average overall activities. Large difference neighbour points are examined. If the difference between the
indicates the most active location stands out. On the con- point and the current seed is smaller than a threshold (30%
trary, small difference indicates the map contains nothing of the range of gray-level value), the point will be added
unique and is suppressed. Feature maps are combined into into the seed queue and will be grown later. The algorithm
three ”conspicuity maps” corresponding to three types of will continue until the seed queue is empty. Finally, the
feature. I¯ for intensity (4), C̄ for color (5) and Ō for orien- output are one or several separate regions and we generate
tation (6). They are calculated through across-scale addi- a smallest rectangle to include these regions.
341
3. MPEG-21 DIGITAL ITEM ADAPTATION 4. INTELLIGENT RESOLUTION ADAPTA-
The upcoming MPEG-21 framework considers the hetero- TION ENGINE
geneous network environments, different terminal devices Current image adaptation frameworks do not provide ROI
and personal characteristics of users to provide an open based resolution adaptation or require users to manually
framework for all the participants in the multimedia con- outline the ROI. Both cases are not always convenient and
sumption chain [1]. The multimedia resource is combined feasible, such as in mobile devices where it is difficult to ac-
with metadata to describe network environment, terminal curately outline an ROI. In our work, by combining MPEG-
capability and user characteristic as the fundamental unit of 21 standard framework and enhanced visual attention model,
distribution and transaction called the Digital Item. MPEG- we improved the standard image adaptation engine to auto-
21 multimedia standard defines the technology needed to matically detect the visual attention region and decide the
support Users to exchange, access, consume, trade and oth- adaptation operation in a standardized, dynamic and in-
erwise manipulate Digital Items in an efficient, transparent telligent way. The advantage of our intelligent resolution
and interoperable way [1]. adaptation engine is to preserve, as much as possible, the
Digital Item Adaptation is an important part of MPEG- most attentive (important) information of the original im-
21 standard (Part 7) [6]. It aims to achieve interopera- age while satisfying terminal screen constraints.
ble transparent access to (distributed) advanced multime- The engine utilizes the Structured Scalable Meta-formats
dia content by shielding Users from network and terminal (SSM) for Fully Content Agnostic Adaptation [4] proposed
installation, management and implementation. In the Final as a MPEG-21 reference software module by HP Research
Committee Draft of Digital Item Adaptation [6], an adap- Labs whose architecture is shown in Figure 1. The SSM
tation engine architecture using Bitstream Syntax Descrip- module adapts the resolution of JPEG2000 images accord-
tion (BSD) [5] is proposed. When the image is input into ing to their ROIs and the terminal screen constraints of the
the adaptation engine, the BinToBSD engine analyses the viewers. BSD description of the JPEG2000 image is gen-
multimedia resource and generates a format-dependent BSD erated by BSDL module [5]. The ROI is automatically de-
or a format-independent general BSD (gBSD) indicating tected using our enhanced visual attention model and adap-
high-level structure of the corresponding image bitstream. tation operation is dynamically decided by considering both
BSD(gBSD) description of Digital Item is then adapted with ROI and terminal screen size constraint. We change the res-
the help of XML Stylesheet Transformation (XSLT) accord- olution of JPEG2000 image by directly adapting JPEG2000
ing to all related information including network situation, bitstream in compressed domain. The whole adaptation pro-
terminal capability and user preference. Finally, BSDToBin cedure is described as follows. The BSD description and ROI
creates a new adapted multimedia resource according to the information are combined with image itself as a digital item.
transformed BSD(gBSD) description. More details can be When the user requests the image, its terminal constraint is
found in [5]. sent to server as a context description (XDI). Then combin-
ing XDI, BSD descrption and ROI information, the Adapta-
tion Decision-Taking Engine decide on the adaptation pro-
Usage
Converter
Constraints not
covered in cess for the image [4]. Finally, the new adapted image and
Environment Usage
Description Environment its corresponding BSD description will be generated by the
Description
Context
BSD Resource Adaptation Engine [5]. Description can be
Digital Item
Adaptation Engine
Digital
Item updated to support multiple step adaptation. A snapshot
of BSD digital item adaptation is shown in Figure 2.
XDI
Adaptation Extraction
AdaptationQoS Decision-Taking
Engine
Format Independent
Constraints
Source Parameter
Values
Adaptation
Operations AdaptationQoS &
(g)BSD
Link Transformation
Engine
(g)BSD based
Resource Adaptation
Engine
Resource (a) (b)
CDI
Extraction Resource’ (g)BSD’ BSDLink’ AdaptationQoS’
Figure 2: Example of Digital Item BSD Adaptation;
CDI
Packetization
(a) Adaptation Decision Description; (b) JPEG2000
BSD Adaptation (Green - Original BSD, Blue -
Adapted Adapted BSD).
Content
Digital Item
342
• If Rsize < Csize < Isize : Crop the ROI according to
the result of visual attention analysis, removing non- Table 1: User Study Evaluation - percentage of im-
attention areas. ages in each category
• If Csize < Rsize : Crop the attention region first and Category Failed Bad Acceptable Medium Good
reduce the region resolution to terminal screen size. Animal 0.02 0.09 0.22 0.33 0.34
(another adaptation can be performed by the adapta- People 0.01 0.11 0.22 0.30 0.36
tion engine) Scenery 0.03 0.13 0.22 0.40 0.22
Others 0.01 0.10 0.26 0.41 0.22
5. EXPERIMENT EVALUATIONS Average 0.017 0.108 0.23 0.38 0.29
600 test images were selected from different categories of
the standard Corel Photo Library. Several output exam-
87% cases are acceptable including 67% are better than ac-
ples of our intelligent visual attention based adaptation are
ceptable. Only 10% are bad and 1% are failed. Bad results
shown in Figure 3 and Figure 4. Due to the subjectivity of
are mainly because not the whole visual object is included
visual attention perspective, we applied the user study ex-
in the cropped images (eg. the head of a animal) and 1%
periment in [3] to test the effectiveness of the proposed algo-
failure rate is due to either wrong visual object identified
rithm. 8 human subjects were invited to evaluate 40 adapted
as the attention region or images like scenery shots where
images for each of the 4 categories. The users were asked to
there may not be specific visual objects. The framework
grade the adapted images from 1 (failed) to 5 (good).
works reasonably well for a general set of natural images.
From the evaluation result shown in Table 1, we found
6. CONCLUSION
Saliency map based visual attention analysis provides an
intelligent content understanding mechanism for multimedia
adaptation. In this paper, we design an improved MPEG-21
image adaptation engine for JPEG2000 using the improved
visual attention model to provide an intelligent ROI based
image resolution adaptation for different terminal devices
according to human visual attention. The advantages of
this engine over others are its capability of ROI automatic
detection and dynamic adaptation decision combining image
ROIs and terminal screen size constraint. This work can be
extended to provide Universal Multimedia Access (UMA)
services compatible with MPEG-21 standard.
7. REFERENCES
[1] J. Bormans and K. Hill. MPEG-21 Overview V.5. In
ISO/IEC JTC1/SC29/WG11/N5231, October 2002.
[2] L. Itti, C. Koch, and E. Niebur. A Model of
Saliency-Based Visual Attention for Rapid Scene
Figure 3: Example of good intelligent adaptation; Analysis. IEEE Tran on Pattern Analysis and Machine
(a) Original Image; (b) Saliency Map; (c) Cropped Intelligence, 20(11), 1998.
Image. [3] Y. Ma and H. Zhang. Contrast-based Image Attention
Analysis by using Fuzzy Growing. In Proc. ACM
Multimedia, Berkeley, CA USA, Novemember 2003.
[4] D. Mukherjee, G. Kuo, S. Liu, and G. Beretta.
Motivation and Use cases for Decision-wise BSDLink,
and a proposal for Usage Environment
Descriptor-AdaptationQoSLinking. In ISO/IEC JTC
1/SC 29/WG 11, Hewlett Packard Laboratories, April
2003.
[5] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch,
C. Timmerer, S. Devillers, and M. Amielh. Bitstream
Syntax Description: A Tool for Multimedia Resource
(a) (b) (c) Adaptation within MPEG-21. Singal Processing: Image
Communication, EURASIP, 18(8), 2003.
Figure 4: Example of bad and failed intelligent adap- [6] A. Vetro and C. Timmerer. ISO/IEC 21000-7 FCD -
tation; (a) Original Image; (b) Saliency Map; (c) Part 7: Digital Item Adaptation. In ISO/IEC JTC
Cropped Image. 1/SC 29/WG 11/N5845, July 2003.
343