Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
0Activity
0 of .
Results for:
No results containing your search query
P. 1
yuhang_dicta09

yuhang_dicta09

Ratings: (0)|Views: 2|Likes:
Published by Abheek Bose

More info:

Published by: Abheek Bose on Sep 13, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

09/13/2011

pdf

text

original

 
Handling Significant Scale Difference for Object Retrieval in a Supermarket
Yuhang Zhang
, Lei Wang
, Richard Hartley
∗†
and Hongdong Li
∗†
The Australian National University
 NICTA Email:
{
 yuhang.zhang, lei.wang, richard.hartley, hongdong.li
}
@anu.edu.au
 Abstract
—We propose an object retrieval application whichcan retrieve user specified objects from a big supermarket.Significant and unpredictable scale difference between thequery and the database image is the major obstacle encoun-tered. The widely used local invariant features show theirdeficiency in such an occasion. To improve the situation, wefirst design a new weighting scheme which can assess therepeatability of local features against scale variance. Also,another method which deals with scale difference throughretrieving a query under multiple scales is also developed. Ourmethods have been tested on a real image database collectedfrom a local supermarket and outperform the existing localinvariant feature based image retrieval approaches. A newspatial check method is also briefly discussed.
 Keywords
-image retrieval; local feature; scale invariance.
I. I
NTRODUCTION
Recent years have seen quite some successful contentbased image retrieval systems which can find objects fromvideo or image set [1], [2], [3]. One salient general characterof these applications is to describe images with local invari-ant features. Local invariant features hold many appealingadvantages. One of them is the scale invariance which hasdemonstrated excellent performance in the existing imageretrieval systems. In this work we conduct an object retrievalin a supermarket. Differently, we show the deficiency of lo-cal invariant features in handling significant scale differencein image retrieval, and introduce our multi-scale methodsto improve the situation. As shown in Figure 1, our queryimage contains the query object generally under a very largescale. In contrast, the corresponding database image containsthe queried object under a much smaller scale together withmany clutters. Moreover, since the query is to be submittedby a user, the retrieval system cannot know the exact scaleof queries in advance and thus cannot down sample eachquery image with a predefined ratio.We choose supermarket as our object retrieval contextsince it provides a comprehensive test for the capability of anobject recognition system. Nowadays a typical supermarketcarries around
50
,
000
different products [4]. This diversityprovide a critical challenge to the discriminative capabilityof an object recognition system. Moreover, in a supermarket
We thank the Coles store located in Woden Plaza, ACT. Their under-standing and support make this work possible and are highly appreciated.
each individual object presents itself among strong clutters.The illumination on different objects varies as they arelocated on different level of shelf. A big variety also lies inthe viewing directions towards different objects. Occlusionand scale difference is also possible.We analyze the deficiency of local invariant featuresand review the general framework of visual word basedimage retrieval in the section of literature review. A brieintroduction to our previous work will also be conducted.We then introduce our new proposed methods and compareits performance with existing methods in experiments. Theproposed methods tackle significant scale difference withtwo tactics. The first is through adaptively assigning theweight of local features (visual words) according to theirrepeatability over different scales. The second is throughdescribing a query under different scales and retrievingeach version of the query independently. The final retrievalresult is then constructed by combining the retrieval resultsobtained under multiple scales. A new spatial check mech-anism will also be briefly introduced before the conclusion.II. L
ITERATURE
R
EVIEW
To handle scale variance, Harris-Laplace detector [5],[6] works in two steps: firstly, a scale-space representationof the Harris function is created and the initial interestpoints are detected by selecting the local maxima in each
3
×
3
co-scale neighborhood under each scale; secondly, theextrema of the LoG (Laplacian of Gaussian) are searchedfor each initial interest point in the scale space within avariance range between
[0
.
7
,
1
.
4]
, namely the identificationof the characteristic scale. The location of the extrema of the LoG, or the characteristic scale of the initial interestpoint is determined by the image pattern but not the scalein which the image pattern is presented. Thus identicalfeatures can be extracted from an image even though theimage is presented in different scales. This scale invari-ance mechanism is inherited by Harris-Affine detector andHessian-Affine detector [5]. However, as mentioned in theirwork the repeatability of Harris-Laplace feature is no morethan
70%
even after a scale change of merely
1
.
4
. That isbecause many points as well as information in the imagedisappears after a reduction in resolution. For SIFT (ScaleInvariant Feature Transform) feature [7], the image is firstly
 
Figure 1. Query and database image example: the top is the query imagecontaining the query object in a very large scale; the bottom is one of the database image containing the queried object in a much smaller scaletogether with much clutters.
down-sampled into sequential octaves. Then within eachoctave the image is convolved with a Gaussian kernel tobuild the scale space representation, from which the localextrema of the DoG (Difference of Gaussian) is detected.By resampling images at each octave, features are extractedunder all possible resolutions and a large range of scalescan be covered. However, SIFT still bears the problem of dropping off features as the scale of the original imagedeclines. Figure 2 gives an intuitive illustration of abovediscussion. The original image (top) is down-sampled at therate of 
2 : 1
(middle) and
4 : 1
(bottom) respectively. Harris-Affine features are then detected under each scale. Let usnow assume that the top image is the query and the imagesat the middle and the bottom are to be retrieved from thedatabase. Obviously, the number of features extracted fromthe query is much larger than that extracted from the tworelevant images in the database. Does it hurt? The extrafeatures from the query will not only find no true matchbut also generate a considerable number of false matcheswith irrelevant images (In the framework of visual wordbased image retrieval, through clustering a feature is alwaysmatched to some others). Hence an irrelevant image mayaccumulate a higher score than a relevant image does. In ourexperiments this problem forms the bottleneck of retrievalperformance.The general framework of visual word based imageretrieval [2] is as follows. First, local invariant featuresare extracted from each image. The extracted features arethen clustered to generate visual words [8]. Features in
Figure 2. Local feature extraction under different scales: From top tobottom are the original image, 2:1 down sampled image and 4:1 downsampled image. Hessian-Affine features marked as yellow ellipses areextracted from each of the tree images individually. As the scale of theimage goes down, the number of extracted features declines sharply.
the same cluster are treated as matched and assigned thesame label, namely the same visual word ID. Each imageis then described by the visual words that it contains, inthe form of a high-dimensional vector. Each dimensionof the vector corresponds to one type of visual word. Toachieve high retrieval accuracy, each dimension of the vectorcan be assigned a
tf-idf 
(term frequency–inverse documentfrequency) weight [2] as shown in Equation 1.
w
ij
is theweight of visual word
i
in image
j
.
n
j
is the number of visual words in image
j
.
n
ij
is the number of occurrencesof visual word
i
in image
j
.
i
is the number of imagesin the database that contain visual word
i
.
is the numberof images in the database. The more a visual word appearsin the database, the less important it is in describing animage; the more it appears in an individual image, the moreimportant it is in describing this image. During retrieval,the similarity between two images is computed as the
L
 p
-norm distance between their corresponding vectors. Alarger distance indicates a smaller similarity. As shown inEquation 2,
d
mn
is the
L
 p
-norm distance between image
m
and
n
.
v
m
and
v
n
are the two vectors corresponding to thetwo images. To achieve high retrieval speed, invert file isused to index the images in the database. That is, for eachvisual word the invert file records all the images that containthis visual word. Given a query, only images in the databasesharing the same visual words with the query are assessed.Our work generally follows above framework but developsnew mechanisms to handle large scale difference which willbe discussed in the next section.
w
ij
=
n
ij
n
j
log
i
(1)
 
d
mn
=
v
m
v
m
 p
v
n
v
n
 p
 p
(2)The work in this paper is an extension of our previouswork [9]. In [9] we showed that the
L
 p
-norm distancedoes not perform well when retrieving objects from strongbackground clutters. Instead, an inner-product in Equation 3was proposed in [9] to measure the similarity between thequery and each database image. Note that the similaritybetween two images is proportional to the inner-productbetween their corresponding vectors. That is to say, insteadof measuring the difference between the two images, weassess the similarity between them directly. Moreover, Equa-tion 1 has been modified to Equation 4 in [9]. According toEquation 1 the weight of each visual words is affected by notonly the number of itself but also the number of other visualwords in the same image. In such a manner the existenceof irrelevant clutters will affect the matching score betweenidentical visual words. Besides, in a supermarket multiplecopies of one product are often placed together on a shelf.According to Equation 1, the existence of multiple copieswill also change the weight of visual words. In this way,a relevant image containing a single true match in strongclutters may be beaten by an irrelevant image containingmultiple copies of false match in weak clutters. Therefore,Equation 1 was replaced by Equation 4, in which the weightof a visual word is only determined by its occurrencefrequency in the database.
d
mn
=
v
m
,
v
n
(3)
w
ij
=
log
i
,
if 
n
ij
>
00
,
otherwise(4)In [9], to avoid the problem of large scale difference eachquery was manually down sampled to a similar scale tothe true matches in the database. This drawback has to beremoved in order to handle practical retrieval tasks, so in thiswork we avoid the manually down sampling and proposenew methods to handle the large scale difference.III. M
ULTI
-
SCALE
I
MAGE
R
ETRIEVAL WITH
L
OCAL
I
NVARIANT
F
EATURES
As stated above, scale difference leads to the imbalancein feature numbers between the query and its true matchesin the database. The extra features from the larger scalegenerate mismatches and cause retrieval to fail. We handlethis deficiency in the following two ways.
 A. Adaptively Assigning Weight before Retrieval
The first way is through adaptively assigning weights todifferent visual words. Weighting visual words is not a newmechanism in the literature, however, previous works donot pay sufficient attention to the repeatability of visualwords over scale variance [2], [3]. Based on our previousdiscussion, a proper weighting scheme should depress theweight of extra words produced by larger scale and increasethe weight of consistent words that can be found under asequence of scales. To check the repeatability of a visualword with regard to a query object, we resize the queryimage into multiple scales and then describe each scale of the query with the same vocabulary. As we can imagine,some of the visual words can survive multiple scales whereassome cannot. We then assign weight to each visual wordthrough Equation 5.
w
ij
is the weight of visual word
i
inquery image
j
.
q
ijk
is the number of occurrences of visualword
i
in image
j
under scale
k
.
q
jk
is the total numberof visual words in query image
j
under scale
k
. ThroughEquation 5, a visual word surviving more scales becomesmore important for a query. Note that the usage of 
q
jk
indicates that a visual word appearing under a scale wherefew visual words can be found has higher weight. This isreasonable because these visual words are more consistentagainst scale change.
w
ij
=
k
q
ijk
q
jk
log
i
(5)Note that we only resize the query image but not thedatabase images, because resizing all images in the databaseand extract features under all possible scales not onlydemand enormous computation time but also require hugestorage space. For the database images, we still weight eachvisual word according to Equation 4, which can effectivelyhandle background clutters and multiple copies of the sameobject.Besides assessing the repeatability of each visual word,Equation 5 also gives each query a more comprehensivedescription because it brings features extracted from multiplescales into one bag. Building an expanded description of thequery image is not a new idea in the literature. Whereasthe work in [10] expands the query description based onthe images in the database, our expansion focuses on thescale aspect and does not require additional images. In lattersections we will refer this method as
A
.
 B. Combine Matching Scores after Retrieval
In Section III-A we assign weights to visual words accord-ing to its repeatability over different scales. Parallel to thatmethod, we can simply treat the query under different scalesas independent queries. After performing retrieval with eachof them individually, we combine the results from differentscales to form a final result for the original query. Bothmethods in current section and the next section follow thisidea.Denote the matching score obtained by the query
m
underscale
k
with database image
n
as
d
mkn
. Since the combina-tion is implemented over multiple scales, we cannot computethe matching score
d
mkn
simply using Equation 3. That is

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->