You are on page 1of 7

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 33
Feature Selection in Top down Visual
Attention Model
Amudha J and Soman.K.P.

Abstract: A Top down visual attention model for sign board recognition has been incorporated to reduce the
computational complexity and to enhance the quality of recognition. This approach is based on a biologically motivated
attention system which is able to detect regions of interest in images based on concepts of the human visual system. A
top-down guided visual search module of the system identifies the most discriminant feature from the previously learned
target object and uses those features to recognize the object. This enables a significantly faster classification rate with
less computation and is well illustrated in identifying signboards in a road scene environment.

Keywords: Computational Attention system, Decision tree, Human Perception, Saliency, Visual Attention.




1. INTRODUCTION
The ability of visual attention models to interpret
complex visual scenes has been remarkably well
evolved. However, the inherent nature of the human
visual attention system disables human to grasp an
entire scene at once and performs a sequence of eye
movements, saccades, for grasping a scene. As goal
for a saccade, a subset of the visual input is selected
by visual attention models which are identified as the
focus of attention. From the focus of attention the
model further shifts its attention to other neighboring
regions with inhibition of return mechanism adopted
in the algorithm. In general the Bottom up attention
system directs the gaze to salient regions from
primary visual information like color, intensity and
orientation, whereas top-down attention systems
identifies salient region using the prior knowledge of
the target objects and built a goal directed visual
search system suitable for object recognition.
Detecting regions of interest with visual attention
is an important mechanism in human visual
perception. However, what is of interest depends on
the situation. Top-down influences like knowledge,
motivations, and emotions play an important role in
human visual attention. For example, drivers perceive
sign cues in the road scene with greater detail rather
than any other cues available. In human behavior,
bottom-up and top-down attention are always
intertwined and may not be considered separately,
although one may outweigh the other in certain
situations. Even in a pure bottom up approach, each
person has own preferences resulting in individual
scan-paths for the same scene. On the other hand,
even if searching is highly concentrated for a target,
the bottom-up pop-out effect is not suppressible; an
effect called attentional capture [11] plays a major
role. Despite its importance in the human visual
system, top-down influences are rarely considered in
computational attention systems. One of the reasons
is that the neuro-biological foundations are not yet
completely understood. Nevertheless, the extension of
an attention system with top-down mechanisms is
unavoidable if regions of interest shall be detected
depending on a task. An important model developed
by Simone Frintrop [6], [7], [8], [9] based on Ittis
work [4], [5] is able to regard top-down cues. In this
models learning phase, the system learns target-
relevant features from a training image considering
the properties of the target as well as the surrounding.
In search mode, the system identifies which
information has to be excited or inhibited and based
on that computes a target dependent top-down
saliency map.
In this paper, a novel detection system using
visual attention model for the detection of signboard
objects in real-world images has been incorporated.
First, regions of interest are focused by an attention

- .Amudha.J is with the Department of Computer Science, Amrita
School of Engineering, Amrita Vishwa Vidyapettham, Bangalore,
560035.
- .Soman.K.P is with the Department of Center for Excellence in
Networking, Amrita School of Engineering, Amrita Vishwa
Vidyapeetham, Coimbatore 6410105.


JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 34
model, using pure [1][2] bottom-up attention and then
the attention model features are extracted in the Focus
of attention area. The extracted features are fed to a
decision tree to identify the discriminant features
required for the classifier model. After this
preprocessing the top down visual attention system
automatically directs the attention model to detect the
learned object in any environment. The performance
of the top down visual attention system was tested for
signboard detection in a road scene environment. In
the proposed model a decision tree is used in learning
phase to identify the most discriminant feature and
those features are used to identify the target which in
turn reduces the computational complexity and
increases the efficiency of the system.
We briefly review the details of the model in
section II with extensions in the same formal
framework. In section III the analysis of the
performance of the model has been tested with
various cases and compared with bottom-up attention
model. In section IV conclusions and further
enhancements are discussed.
2. SIGNBOARD DETECTION SYSTEM
The block diagram in Figure.1 describes the
system which has three major parts, the first one the
Visual Attention model (VAM) identifies the most
attended region [5], second one extracts the Feature
from the VAM for signboard detection and the last
the feature selection module which identifies the
relevant features required for the signboard
identification and uses for further classification. The
following sections present the algorithm in detail.

2.1 Visual Attention Model
The input image I is sub-sampled into a Gaussian
pyramid on 4 different scales, and each pyramid level
is decomposed into channels for red(R), green (G),
blue (B), yellow (Y), intensity (I) using (1), (2), (3), (4)
and (5) The orientation channels are obtained from I
using oriented Gabor pyramids O(o,u), where oe
[0..5] is the scale, and ue{0, 45, 90, 135} is the
preferred orientation.

3 / ) ( g b r I + + = (1)
2 / ) ( b g r R + = (2)
2 / ) ( b r g G + = (3)
2 / ) ( g r b B + = (4)
) | (| 2 b g r g r Y + + = (5)

Each feature is computed in a center- surround
structure as difference between a fine and a coarse
scale. The center of the receptive feature corresponds
to the pixel at the level c e {2, 3} in the pyramid and
the surround corresponds to the pixel at the level s =
c+1. Hence we compute three feature maps. One
feature type encodes for on/off image intensity
contrast, two encodes for red/green and blue/yellow
double component channels. The intensity feature
type encodes for the modulus of image luminance
contrast. That is the absolute value of the difference
between the intensity at the center and the intensity in
the surround as given in (6)

|) ) ( ) ( (|
) , , (
s I c I N I
S C I
O =
(6)

Across scale difference between two maps, denoted
as , is obtained by interpolation to the finer scale
and point-by-point subtraction. The quantity
corresponding to the double-opponency cells in
primary visual context are then computed by center
surround differences across the normalized color
channels.




Fig.1 Visual Attention Model with decision tree classifier

Each of the three-red /green Feature map is created
by first computing (red-green) at the center, then
subtracting (green-red) from the surround and finally
Apply threshold
Center-surround difference and normalisation
Across-Scale Combinations and normalisation
RG BY color
Maps
Image
Build & Evaluate
Decision Tree
Color Intensity Orientation
Linear Filtering
Linear Combination
Color
Intensity Orientatio
Saliency map
Channel Maps
Intensity Maps
Conspicuity Maps
Build Training Set

Identify discriminant
features

Target
0,45,90,135
orientation
Feature Maps
MSR
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 35
outputting the absolute value. Accordingly maps
RG(c,s) are created in the model to simultaneously
account for red/green and green/red double
opponency and BY(c,s) for blue/yellow and
yellow/blue double opponency and orientation using
(7) ,(8) and (9)

|) )) ( ) ( ( )) ( ) ( (|
,
s G s R c G c R N RG
S C
O = (7)
|) )) ( ) ( ( )) ( ) ( (|
,
s s B c Y c B N BY
S C
O = (8)
|) ) , ( ) , ( (|
, ,
u u
u
s O c O N O
S C
O = (9)

The feature maps are then combined into three
conspicuity maps, intensity I (10), color C (11), and
orientation O(12), at the saliency maps scale (o= 4).
These maps are computed through across-scale
addition (), where each map is reduced to scale four
and added point-by-point:

))] , ( ( [
4
3
4
2
s c I N I
c
c s c
+
+ = =
= (10)
))] , ( ( )) , ( ( [
4
3
4
2
s c BY N s c RG N C
c
c s c
+ =
+
+ = =
(11)

To compute the orientation conspicuity map, four
intermediary maps are created by combining the six
feature maps. These intermediary maps are then
combined into a single orientation conspicuity map
using (12).

=
+
+ = =
=
} 135 , 90 , 45 , 0 {
4
3
4
2
))) , , ( ( (
u
u s c O N N O
c
c s c
(12)

The three conspicuity maps are then normalized
and summed into the input S to the saliency map:

) ( ) ( ) ( O N C N I N S + + = (13)

The N(.) represents the non-linear Normalization
operator. The three conspicuity maps are normalized
and summed into the final input S to the Saliency
map (13). From the saliency map the most attention
regions are identified in the order of decreasing
saliency based on the selective tuning model [2].

2.2 Feature Extraction
Since we are only interested in the saliency of the
target object, here a signboard, the most salient region
is determined. From the Focus of Attention, 11 feature
vectors are determined. The features are obtained
from the values of the 7 feature maps (RG, BY,
Intensity, Orientation 0, 45, 90, 135) and 3
conspicuity maps (Color, Intensity, Orientation) and 1
saliency map. It describes how much each feature
contributes to the FOA. The classifier identifies
whether the obtained feature vector from the target
region matches with the actual target.

2.3 Feature Selection & Classification
To identify whether the salient region is the actual
target, a decision tree classifier is used. The Classifier
algorithm uses an entropy-based measure known as
information gain as a heuristic for selecting the
attribute that will best separate the target vehicle into
individual classes. The decision tree construction is as
follows: The tree starts as a single node representing
the training samples. If the samples are all of the same
class, then the node becomes a leaf and is labeled with
that class. Otherwise the algorithm uses an entropy-
based measure known as information gain as a
heuristic for selecting the attribute that will best
separate the samples into individual classes. A branch
is created for each known value of the test attribute
and the samples are portioned accordingly. The
algorithm uses the same process recursively to form a
decision tree for the samples at each partition. Once
an attribute has occurred at a node, it need not be
considered in any of the nodes descendents i.e. an
attribute is used only once.
The recursive partitioning stops only when any one
of the following conditions is true. All samples for a
given node belong to the same class. There are no
remaining attributes on which the samples may be
further partitioned. In this case, majority voting is
employed. This involves converting the given node
into a leaf and labeling it with the class in majority
among samples. There is no branch test-attribute i.e.
all the attributes are used. In this case, a leaf is created
with the majority class in samples. The major features
which contribute to the target classification are as
shown in Fig. 2. After the feature vector for the target
object has been computed, this vector can be used to
search for the target in other images.
The bottom-up attention initially gives the first pop-
out attention region, the classifier checks whether it
matches with the feature vector. If there is a match the
process ends otherwise the classifier checks with the
next level of attention. This process is repeated for
three levels after which the samples from all other
regions are tested. Hence the computation time
involved in identifying the hierarchy of the attention
levels is reduced and mapped for categorization of
the signboard.
3. RESULTS AND ANALYSIS
Fig. 2 shows a sample of the signboards
considered to test the system. It includes signboards
of pedestrian, crossing and bike. The analysis is done
in two levels 1) survey on signboard images using
Visual Attention Model 2) analysis for feature
selection.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 36



Fig. 2- Sample test Images for signboards comprising of
Pedestrian, Crossing and Bike Signs

3.1 Survey using Visual Attention Model
First, a survey was done to analysis the
performance of the bottom-up attention visual
attention model for the signboard dataset using Ittis
saliency toolbox [12].


(a) (b) (c)

Fig. 3 Attention area identified by VAM in various cases (a)
First level (b) Fourth level (c) Failed

From Fig. 3, it is shown that the VAM identifies the
signboard in some cases at the first level of iteration,
in some at later levels and in few it failed.

TABLE 1
RESULTS OF BOTTOM UP VISUAL ATTENTION MODEL
Image
Type
Total
Number
of
Images
Attention
model
Failed
to
Identify
First
Level
Other
Levels
Pedestrian 16 7 9

0
Bike 16 5 9 2
Crossing 16 2 13 1


Table 1 gives a statistics of the number of cases that
were able to identify the signboard as the first
attention, in cases it was able to identify only in the 2,
and later levels of iteration and the number of failed
cases. In the dataset we have considered, the Image
set having Pedestrian sign has 7 images in which the
attention model automatically pops-out the signboard
region. In the other 9 images the signboard was not
having a pop-out attention nature and hence the other
regions were attended first and signboard was
identified at next level or in the next focus of
attention. In cases like Bike and Crossing the first
level of attention is obtained only in a few cases and
there were cases where the bottom-up attention
model was not able to identify the target region even
after eight levels of iteration.

3.2 Feature Selection
The features obtained in the bottom up model are
used as a training set for the classifier. To train the
decision tree classifier, a set of eight samples of each
type was considered. The visual attention model was
computed up to the saliency map.

TABLE 2
TRAINING AND TESTING SAMPLES FOR SIGNBOARD
DETECTION

Image Number of
training
samples
Number of
testing
samples
Pedestrian 88 30
Crossing 75 20
Bike 175 40
Total 338 90

The eleven features from 7 feature maps (RG, BY,
Intensity, Orientation 0, 45, 90, 135) and 3
conspicuity maps (Color, Intensity, Orientation) and 1
saliency map were extracted and the class to which it
belongs are taken as input data to train the classifier.
The table 2 shows the number of samples used for
training and testing. The sample considers both the
target region information and non class information
which is the background information also.

The sample considers partially occluded regions
also.The Fig. 4 shows decision tree classifier where the
most discriminant feature to identify the class is
shown. From the decision tree classifier the hierarchy
of the eleven features can be identified.

The table 3 shows the classification rate where
various combination of features were considered. The
classification rate doesnt show significant
improvement when the number of features chosen
was reduced. Hence the object signboard is able to
achieve the same classification rate with the two
discriminant features namely color conspicuity map and
the 90 orientation feature map at the first foci of
attention itself.



JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 37
Fig. 4 Decision Tree Classifier

There is no significant change in the classification
rate by reducing the feature vector size from 11 to 2.
Experiments show that the efficiency of the system is
88%. From Table 3 it is explicit that features of color
and orientation are sufficient to bring out a maximum
efficiency for the application considered. The
efficiency of the system doesnt improve by increasing
the number of features. Hence only two features are
required to discriminant the road sign from the rest of
the scene. The cases where the model failed to
identify at the first level of search are mostly when the
information is partially occluded or when the
information is having less amount of region of
interest and more of background information.

3.3 Advantages of the System
The feature selection adopted here helps to reduce
the search time involved in identifying the target
object. Another added advantage to the system is that
the classifier was able to detect signboard in cases
where the bottom up visual attention system failed as
shown in Fig. 5.



Fig. 5 Cases where the attention model failed to identify the
signboard in the background whereas our system was able to
identify the target information.

TABLE 3
CLASSIFICATION RATES FOR VARIOUS FEATURES IDENTIFIED
AT THE FIRST LEVEL OF SEARCH

Number
of
Features
Features Selected Classification
rate
11 3,8,7,2,4,6,9,1,5,10,
11
97%
8 3,8,7,2,4,6,9,1 97%
7 3,8,7,2,4,6,9 97%
5 3,8,7,2,4 97%
4 3,8,7,2 98%
3 3,8,7 96%
2 3,8 98%

The feature selection used before recognizing
reduces the search time as well give better recognition
rate. The feature selection module followed by the
decision tree classifier was able to identify the sign
board which the visual attention model failed to
identify. Hence the usage of feature selection module
before applying all the features used in visual
attention model for the classifier improves both time
and quality of recognition.
The Table 2 shows the feature which has inhibition
and excitation effects on the region of interest. The
features like color CM and orientation 90 FM
excitation values can be used to construct the top-
down map which can be used as as mask to identify
the region of interest as discussed in VOCUS[7][10].
However the classifier used in this system was able to
achieve a better detection rate at the first level of
attention itself as shown in Table.3.

TABLE 2
THE HIERARCHY OF FEATURES WITH ITS INHIBITION AND
EXCITATION VALUES TO RECOGNIZE THE TARGET

No Features Level Inhibitio
n Value
Excitation
Value
1 Color RG
FM
7 - 0.13
2 Color BY
FM
6 0.9
<0.26
<0.09
3 Color
CM
1 =0 0-53.32
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 38
4 Intensity
FM
5 <0.38
0.38
<0.54
0.54
5 Intensity
CM
8 - 0
6 Orientation
0
O

FM
6 0.43
<0.89
<0.43
7 Orientation
45
O

FM
3 0.27-0.53
<0.27
0.54
8 Orientation
90
O

FM


2 0.27-0.53
>0.53
<0.27
9 Orientation
135
O

FM
6 <0.35 0.35-
<0.73
10 Orientation
CM
9 - 0
11 Saliency
Map
10 - 0

TABLE 3
RESULTS OF TOP DOWN VISUAL ATTENTION MODEL

Image
Type
Total
Number
of
Images
Attention
model
Failed
to
Identify
First
Level
Other
Levels
Pedestrian 16 16 0

0
Bike 16 15 1 0
Crossing 16 14 2 0

The advantage of this system over existing
VOCUS is four-fold. First and foremost it doesnt
require any training algorithms for selection of
training samples. Secondly, there is no need to fix up
the weight values of various features. Thirdly, there is
no additional computation of building a top-down
saliency map. Finally, the testing sample considered
here has partially occluded information which is not
studied by the previous system.


4. CONCLUSIONS

This paper proposes a novel approach for object
detection using attention system with decision tree for
feature selection and classification. This model
decreases the computational complexity and increases
the quality of the search for the objects. The model has
been tested on a vehicle tracking application also and
the results are encouraging [3]. Further research can
be done on increasing the number of visual stimulus
like motion for prominent visual feature on the image
and also on the weights that can be allocated to the
features.

REFERENCES
[1] Amudha.J, Soman K.P, Vasanth. K, Video Annotation Using
Saliency, International conference on Image processing,
Computer vision and Pattern Recognition Vol 1, 191-195, 2008.
[2] Amudha J and K P Soman, Selective tuning Visual Attention
Model International Journal of Recent Trends in Engineering, Vol
2, No.2, pp. 117-119, 2009.
[3] Amudha J and K P Soman, Saliency based Visual Tracking of
vehicles International Journal of Recent Trends in Engineering,
Vol 2, No..2, pp. 114-116, 2009.
[4] C. Koch and S. Ullman, Shifts in selective visual attention:
towards the underlying neural circuitry, Human Neurobiology,
Vol. 4, pp. 219-227, 1985.
[5] L.Itti, C. Koch, and E. Niebur, A model of saliency-based visual
attention for rapid scene analysis, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 20, No.11, 12541259, 1998
[6] Simone Frintrop and Erich Rome, Simulating Visual Attention
for Object Recognition Fraunhofer Institute for Autonomous
Intelligent Systems (AIS) Schloss Birlinghoven, D-53754 Sankt
Augustin, Germany
[7] Simone Frintrop, VOCUS: a visual attention system for object
detection and goal-directed search, Ph.D. dissertation, Rheinische
Friedrich- Wilhelms-Universitat Bonn, Germany, July 2005,
published 2006 in Lecture Notes in Artificial Intelligence (LNAI),
Vol. 3899, Springer Verlag Berlin/Heidelberg.
[8] Simone Frintrop, Patric Jensfelt, and Henrik Christensen. Pay
attention when selecting features. In Proc. Intl Conf. on Pattern
Recognition (ICPR 2006), 2006.
[9] Simone Frintrop, Patric Jensfelt, and Henrik Christensen.
Attentional robot localization and mapping. In ICVS Workshop
on Computational Attention & Applications (WCAA), 2007.
[10] Simone Frintrop, Maria Klodt, and Erich Rome. A real-time visual
attention system using integral images. In Proc. of Intl Conf. on
Computer Vision Systems (ICVS), 2007.
[11] Theeuwes.J, Top down search strategies cannot override
attentional capture. Psychonomic Bulletin And Review, 11:65-70,
2004
[12] URL for Saliency tool box [Online] Available
http://www.saliencytoolbox.net/




JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 39
Author Biographies
Amudha. J is an assistant Professor in Amrita Vishwa
Vidyapeetham, Bangalore. Her qualifications include B.E.,
Electrical and Electronics Engineering from Bharathiyar
University, Coimbatore, India and M.E. Computer Science and
Engineering, Anna University, Chennai, India. Her research
interests include image processing, computer vision and soft
computing.



Soman .K.P is the head, CEN, Amrita Vishwa Vidyapeetham
Amrita Vishwa Vidyapeetham, Ettimadai, Coimbatore 641105.
His qualifications include B.Sc. Engg. in Electrical engineering
from REC, Calicut P.M. Diploma in SQC and OR from ISI,
Calcutta, M.Tech (Reliability engineering) from IIT, Kharagpur
PhD (Reliability engineering) from IIT, Kharagpur. Dr. Soman
held the first rank and institute silver medal for M.Tech at IIT
Kharagpur. His areas of research include optimization, data
mining, signal and image processing, neural networks, support
vector machines, cryptography and bio-informatics. He has over
55 papers in national and international journals and
proceedings. He has conducted various technical workshops in
India and abroad.

You might also like