Professional Documents
Culture Documents
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
recognition and learning. In Section 3 we introduce in de- 2.1.1 Bayesian framework
tail the method used in our experiments. In Section 4 we We start with a learnt object class model and its correspond-
discuss experimental results on four real-world categories ing model distribution , where is a set of model pa-
of different objects (Figure 1). rameters for the distribution. We are then presented with a
new image and we must decide if it contains an instance of
Faces Motorbikes Airplanes Spotted cats Background
our object class or not. In this query image we have iden-
tified interesting features with locations , and appear-
ances . We now make a Bayesian decision, . For clarity,
we explicitly express training images through the detected
feature locations and appearances .
Object
(1)
No Object
Object Object
(3)
No Object No Object
Object
Note the ratio of No Object in Eq. is usually set manu-
ally to , hence omitted in Eq.. Evaluating the integrals in
Eq. analytically is typically impossible. Various approxi-
mations can be made to simplify the task. The simplest one,
Maximum Likelihood (ML) assumes that is a delta-
function centered at [6, 10]. This allows the
integral to collapse to . The ML approach is
Figure 1: Some sample images from our datasets. The first four
clearly a crude approximation to the integral. It assumes a
columns contain example images from the four object categories used well peaked so that is a suitable estima-
in our experiments. The fifth column shows examples from the back- tion of the entire distribution. Such an assumption relies
ground dataset. This dataset is obtained by collecting images through the heavily on sufficient statistics of data to give rise to such a
Google image search engine (www.google.com). The keyword “things”
is used to obtain hundreds of random images. Note only grayscale in-
sensible distribution. But in the limit of just a few training
formation is used in our system. Complete datasets can be found at examples, ML is most likely to be ill-fated.
http://vision.caltech.edu/datasets. At the other extreme, we can use numerical methods
such as Markov-Chain Monte-Carlo (MCMC) to give an ac-
2. Approach curate estimate, but these can be computationally very ex-
pensive. In the constellation model, the dimensionality of
Our goal is to learn a model for a new object class with very is large ( ) for a reasonable number of parts, making
few training examples (in the experiments we use just ) MCMC methods impractical for our problem.
in an unsupervised manner. We postulate that this may be An alternative is to make approximations to the integrand
done if we can exploit information obtained from previously until the integral becomes tractable. One method of doing
learnt classes. Three questions have to be addressed in or- this is to use a variational bound [11, 12, 13, 14]. We show
der to pursue this idea: How do we represent a category? in Sec.2.2 how these methods may be applied to our model.
How do we represent “general” information coming from
the categories we have learnt? How do we incorporate a
few observations in order to obtain a new representation? 2.1.2 Object representation: The constellation model
In this section we address these questions in a Bayesian set- Now we need to examine the details of the model
ting. Note that we will refer to our current algorithm the . We represent object categories with a constel-
Bayesian One-Shot algorithm. lation model [6, 10]
2.1. Recognition
The model is best explained by first considering recogni- (4)
tion. We will introduce a generative model of the object where is a vector of model parameters. Since our model
category built upon the constellation model for object rep- only has (typically 3-7) parts but there are (up to 100)
resentation [6, 9, 10]. features in the image, we introduce an indexing variable
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
which we call a hypothesis. is a vector of length , where and superscripts denote shape and appearance terms
each entry is between and which allocates a particular respectively. Collecting all mixture components and their
feature to a model part. The set of all hypotheses consists corresponding parameters together, we obtain an overall pa-
all valid allocations of features to the parts; consequently rameter vector . Assuming we
, the total number of hypotheses in image is . have now learnt the model distribution from a
The model encompasses the important properties of an set of training data and , we define the model distri-
object: shape and appearance, both in a probabilistic way. bution in the following way
This allows the model to represent both geometrically con-
strained objects (where the shape density would have a
small covariance, e.g. a face) and objects with distinc- (5)
tive appearance but lacking geometric form (the appearance where the mixing component is a symmetric Dirichlet:
densities would be tight, but the shape density would now , the distribution over the shape pre-
be looser, e.g. an animal principally defined by its texture cisions is a Wishart
and the
such as a zebra). Note, that in the model the following as- distribution over the shape mean conditioned on the preci-
sion matrix is Normal:
.
sumptions are made: shape is independent of appearance;
for shape the joint covariance of the parts’ position is mod-
Together the shape distribution is a Normal-
eled, whilst for appearance each part is modeled indepen- Wishart density [14, 15]. Note are
dently. In the experiments reported here we use a slightly hyperparameters for defining their corresponding distribu-
simplified version of the model presented in [10] by remov- tions of model parameters. Identical expressions apply to
ing the terms involving occlusion and statistics of the fea- the appearance component in Eq. 5.
ture finder, since these are relatively unimportant when we
only have a few images to train from.
2.1.4 Bayesian decision
Appearance. Each feature’s appearance is represented
as a point in some appearance space, defined below. Each Recall that for the query image we wish to calculate the ratio
part has a Gaussian density within this space, with of Object and No object .
mean and precision parameters which It is reasonable to assume a fixed value for all model
is independent of other parts’ densities. Each feature se- parameters when the object is not present, hence the
lected by the hypothesis is evaluated under the appropri- latter term may be calculated once for all. For the
ate part density. The distribution becomes former term, we use Bayes’s rule to obtain the likeli-
where is the Gaussian
bution. The background model has the same form of
distri-
hood expression: Object which expands
to . Since the likelihood
distribution, , with parameters contains Gaussian densities and the parameter
form because we assume the features are spread uniformly
Note is the dimensionality defined in Eq. . If the ra-
over the image (which has area ) and are independent of
tio of posteriors, in Eq. 3, calculated using the likelihood
the foreground locations.
expression above exceeds a pre-defined threshold, then the
image is assumed to contain an occurrence of the learnt ob-
2.1.3 Model distribution ject category.
Let us consider a mixture model of constellation models
with components. Each component has a mixing co- 2.2. Learning
efficient ; a mean of shape and appearance ; a The process of learning an object category is unsupervised
precision matrix of shape and appearance . The [6, 10]. The algorithm is presented with a number of train-
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
ing images labeled as “foreground images”. It assumes
there is an instance of the object category to be learnt in
each image. But no other information, e.g. location, size, (9)
shape, appearance, etc., is provided.
In order to estimate a posterior distribution where . Both the terms involving
of the model parameters given a set of training data above have a Normal form. The prior on the model param-
as well as some prior information, we formulate eters has the same form as the model distribution in Eq. 5
this learning problem as variational Bayesian expectation
(10)
maximization (“VBEM”), applied to a multi-dimensional
Gaussian mixture model. We first introduce the basic con-
cept and assumptions of the variational method. Then we where the mixing prior is ,
illustrate in more detail how such learning is applied to our and the shape prior is a Normal-Wishart distribution
.
model.
Identical expressions apply to the appearance component of
Eq. 10.
2.2.1 Variational methods
inequality to give us a lower bound on the integral, we get: is the actual model parameter. In the E-step of VBEM,
is updated according to
(6) (11)
where (12)
(7)
and the expectation is taken w.r.t.
equation can be further written as
[15]. The above
providing
(8)
(13)
Variational Bayes makes the assumption that is The rule for updating the indicator posterior is
a probability density function that can be factored into
. We then iteratively optimize and
using
(14)
expectation maximization (EM) to maximize the value of
the lower bound to the integral (see [15, 16]). If we con-
sider the “true” p.d.f., by using the above method,
where
(15)
we are effectively decreasing the Kullback-Leibler distance
(19)
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
which is the probability that component is responsible for they are cropped from the image and rescaled to the size
hypothesis of the training image. of a small ( ) pixel patch. Each patch exists in a
dimensional space. We then reduce this dimensional-
The M-Step. In the M-step, is updated according to
ity by using PCA. A fixed PCA basis, pre-calculated from
(20) the background datasets, is used for this task, which gives
us the first principal components from each patch. The
where (21)
principal components from all patches and images form .
Again, the above equation can be written as
20
20
40
(22) 40 60
80
60 100
140
200
220
The equations are exactly the same for the appearance com- 50 100 150 200 250 50 100 150 200 250 300 350 400
ponents. We define the following variables Figure 2: Output of the feature detector
3.2. Learning
(23)
The task of learning is to estimate the distribution
of . VBEM estimates the hyperparameters
that define the dis-
(24)
tributions of the parameters . The
(25)
goal is to find the distribution of that best explains the
data from all the training images.
(26)
30
shape prior
30
shape prior
30
shape prior
30
shape prior
20 20 20 20
10
10 10 10
10
20
0
10
0
10
0
10
40
20
30
20
30
20
30
. For the means, we have
3
appearance prior
3
appearance prior
3
appearance prior
3
appearance prior
where
1 1 1 1
-1 0 1 -1 0 1 -1 0 1 -1 0 1
2 2 2 2
0.5 0.5 0.5 0.5
-1 0 1 -1 0 1 -1 0 1 -1 0 1
2 2 2 2
(27)
0.5 0.5 0.5 0.5
-1 0 1 -1 0 1 -1 0 1 -1 0 1
(a) Face prior (b) Motorbike prior (c) Spotted cat prior (d) Airplane prior
(28) Figure 3: Prior distribution for shape mean ( ) and appearance
mean ( ) for faces, motorbikes, spotted cats and airplanes. Each
For the noise precision matrix we have a Wishart density
where
prior’s hyperparameters are estimated from models learnt with
maximum likelihood methods, using “the other” datasets [10]. For
example, face prior is obtained from models of motorbikes, spot-
ted cats and airplanes. Only the first three PCA dimensions of the
(29)
appearance priors are displayed. All four parts of the appearance
begin with the same prior distribution for each PCA dimension.
3. Implementation
One critical issue is the choice of priors for the Dirichlet
3.1. Feature detection and representation and Norm-Wishart distributions. In this paper, learning is
We use the same features as in [10]. They are found using performed using a single mixture component. So is set
the detector of Kadir and Brady [17]. This method finds to , since will always be . Ideally, the values for the
regions that are salient over both location and scale. Gray- shape and appearance priors should reflect object models in
scale images are used as the input. The most salient re- the real world. In other words, if we have already learnt
gions are clustered over location and scale to give a reason- a sufficient number of classes of objects (e.g. hundreds or
able number of features per image, each with an associated thousands), we would have a pretty good idea of the aver-
scale. The coordinates of the center of each feature give us age shape (appearance) mean and variances given a new ob-
. Figure 2 illustrates this on two images from the motor- ject category. In reality we do not have the luxury of such
bike and airplane datasets. Once the regions are identified, a number of object classes. We use four classes of object
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
models learnt in a ML manner from [10] to form our pri- estimate of performance. When , ML fails to con-
ors. They are: spotted cats, motorbikes, faces and airplanes. verge, so we only show results for the Bayesian One-Shot
Since we wish to learn the same four datasets with our algo- algorithm in this case.
rithm, we use a “leave one out” strategy. For example, when When evaluating the models, the decision is a simple ob-
learning motorbikes we obtain priors by averaging the learnt ject present/absent one. All performance values are quoted
model parameters from the other three categories(i.e. spot- as equal error rates from the receiver-operating characteris-
ted cats, faces and airplanes), hence avoiding the incorpora- tic (ROC) (i.e. (True positive) = 1 - (False alarm)). ROC
tion of an existing motorbike model. The hyperparameters curve is obtained by testing the model on foreground test
of the prior are then estimated from the parameters of the images and background images . For example, a value of
existing category models. Figure 3.2 shows the prior shape means that of the foreground images are correctly
model for each category. classified but of the background images are incorrectly
Initial conditions are chosen in the following way. Shape classified (i.e. false alarms). A limited amount of prepro-
and appearance means are set to the means of the train- cessing is performed on some of the datasets. For the mo-
ing data itself. Covariances are chosen randomly within torbikes and airplanes some images are flipped to ensure all
a sensible range. Learning is halted when the parameters objects are facing the same way. In all the experiments, the
change per iteration falls below a certain threshold ( ) following parameters are used: number of parts in model
or exceeds a maximum number of iterations (typically ). ; number of PCA dimensions for each part appearance
In general convergence occurs within less than itera- ; and average number of detections of interest point
tions. Repeated runs with the same data but different ran- for each image . It is also important to point out that
dom initializations consistently give virtually indistinguish- except for the different priors obtained as described above,
able classification results. Since the model is a generative all parameters remain the same for learning different cate-
one, the background images are not used in learning except gories of objects.
for one instance: the appearance model has a distribution
in appearance space modeling background features. Esti-
mating this from foreground data proved inaccurate so the 4. Results
parameters are estimated from a set of background images
and not updated within the VBEM iteration. Learning a Our experiments demonstrate the benefit of using prior in-
class takes roughly about seconds on a GHz ma- formation in learning new object categories. Fig. - show
chine when number of training images is less than and models learnt by the Bayesian One-Shot algorithm on the
the model is composed of parts. The algorithm is imple- four datasets. It is important to notice that the “priors”
mented in Matlab. It is also worth mentioning that the cur- alone are not sufficient for object categorization (Fig. -
rent algorithm does not utilize any efficient search method (a)). But by incorporating this general knowledge into the
unlike [10]. It has been shown that increasing the number training data, the algorithm is capable of learning a sensible
of parts in a constellation model results in greater recog- model with even training example. For instance, in Fig.
nition power provided enough training examples are given (c), we see that the 4-part model has captured the essence
[10]. Were efficient search techniques used, - parts could of a face (e.g. eyes and nose). In this case it achieves a
be learnt, since the VBEM update equations are require the recognition rate as high as , given only 1 training ex-
same amount of computation as the traditional ML ones. ample. Table 1 compares our algorithm with some pub-
However, all our experiments currently use part models lished object categorization methods. Note that our algo-
for both the current algorithm and ML. rithm has significantly faster learning speed due to much
smaller number of training examples.
3.3. Experimental Setup Algorithm Training Learning Categories Error Rate Remarks
number speed ( )
Each experiment is carried out in the following way. Each Bayesian 1 to 5 min faces, motorbikes, spot-
4-part model, un-
dataset is randomly split into two disjoint sets of equal size. One-Shot
[6] [10] 200 to hours
ted cats, airplanes
faces, motorbikes, spot-
supervised
6-part model, un-
training images are drawn randomly from the first. A 400 ted cats, airplanes, cars supervised
fixed set of are selected from the second, which form [7] weeks faces
aligned manually
[5] days faces, cars
aligned manually
the test set. We then learn models using both Bayesian and [2] days faces
aligned manually
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
Performance comparison Sample ROC of different learning models Performance comparison Sample ROC of different learning models
Prob(detection)
Prob(detection)
0.4 Maximum-Likelihood 0.7 0.4 0.7
Bayesian OneShot Maximum-Likelihood 0.6
0.6
Bayesian OneShot
0.3 0.5 0.3 0.5
0.4 0.4
0.2 0.3 0.2 0.3
0.2 Maximum-Likelihood TrainNum=5 0.2 Maximum-Likelihood TrainNum=5
Bayesian OneShotTrainNum=1 Bayesian OneShotTrainNum=1
0.1 0.1 Bayesian OneShot TrainNum=2
0.1 0.1 Bayesian OneShot TrainNum=2
Bayesian OneShot TrainNum=5 Bayesian OneShot TrainNum=5
0 0
0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1
Training Number Prob(false alarm) Training Number Prob(false alarm)
60 60 60
60
Appearance Model (Training # = 1) Appearance Model (Training # = 5) Appearance Model (Training # = 1) Appearance Model (Training # = 5)
40 40 2
2 2 40 40 2
20 20
20 20
0 0 0 0 0 0
-1 0 1 0 -1 0 1
-1 0 1
2 0 -1 0 1
2 2 2
20 20
20
20
40 40
40
0 0
0
-1 0 1 -1 0 1 -1 0 1 40 0
-1 0 1
60 60
2
60 2 2
2
80 60
80 80
Correct Correct Correct Correct Correct Correct Correct Correct INCORRECT INCORRECT Correct Correct Correct INCORRECT Correct Correct
Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct INCORRECT Correct Correct Correct Correct Correct
Correct INCORRECT Correct Correct Correct Correct Correct Correct INCORRECT Correct Correct Correct INCORRECT
Correct Correct Correct
0.5 0.9
obtained by repeated runs with different randomly drawn training and 0.4
0.8
Prob(detection)
0.7
Maximum-Likelihood
testing images. Error bars show one standard deviation from the mean 0.3
Bayesian OneShot 0.6
0.5
performance. This result is compared with the maximum-likelihood (ML) 0.2
0.4
0.3
method (green). Note ML cannot learn the degenerate case of a single 0.1 0.2 Maximum-Likelihood TrainNum=5
Bayesian OneShotTrainNum=1
0.1 Bayesian OneShot TrainNum=2
training image. (b) Sample ROC curves for the algorithm (red) compared 0
Bayesian OneShot TrainNum=5
0
0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1
with the ML algorithm (green). The curves shown here use typical mod- Training Number Prob(false alarm)
els drawn from the repeated runs summarized in (a). Details are shown in (a) (b)
(c)-(f). (c) A typical model learnt with training example. The left panel Shape Model (Training # = 1)
Part 1 Part 2 Part 3 Part 4
Shape Model (Training # = 5)
Part 1 Part 2 Part 3 Part 4
shows the shape component of the model. The four ’s and ellipses indi-
80 60
60 40
Appearance Model (Training # = 1) Appearance Model (Training # = 5)
2 2
40 20
cate the mean and variance in position of each part. The covariance terms 20
0
0
0
-1 0 1 -1 0 1
0 2 20 2
are not shown. The top right panel shows the detected feature patches in 20 40
0 0
40 -1 0 1 60 -1 0 1
the training image closest to the mean of the appearance densities for each 60
2
80
2
of the four parts. The bottom right panel shows the mean appearance distri- -1 0 1 -1 0 1
(c) (d)
butions for the first PCA dimensions. Each color indicates one of the four
INCORRECT INCORRECT INCORRECT Correct Correct Correct Correct Correct
parts. Note the shape and appearance distributions are much more “model
specific” compare to the “general” prior model in Fig.3.2. (e) Some sample
foreground test images for the model learnt in (c), with a mix of correct and Correct INCORRECT Correct Correct INCORRECT Correct INCORRECT Correct
incorrect classifications. The pink dots are features found on each image
and the colored circles indicate the best hypothesis in the image. The size Correct Correct Correct INCORRECT Correct Correct INCORRECT Correct
of the circles indicates the score of the hypothesis (the bigger the better).
(d) and (f) are similar to (c) and (e). But the model is learnt from training
Correct Correct Correct Correct Correct Correct Correct Correct
images.
(e) (f)
Figure 6: Summary of spotted cat model.
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE
Performance comparison Sample ROC of different learning models
performance (equal error rates) 1
0.5 0.9
0.8 velop an incremental version of our algorithm, where each
Prob(detection)
0.4 Maximum-Likelihood 0.7
Bayesian OneShot 0.6 training example will incrementally update the probability
0.5
0.3
0.4 density function defined on the parameters of each object
0.3
0.2 0.2 Maximum-Likelihood TrainNum=5
Bayesian OneShotTrainNum=1
category [18]. In addition, the minimal training set and
0.1
0.1
0
Bayesian OneShot TrainNum=2
Bayesian OneShot TrainNum=5
learning time that appear to be required by our algorithm
0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1
Training Number Prob(false alarm) makes it possible to conceive of visual learning applications
(a) (b) where real-time training and user interaction are important.
Part 1 Part 2 Part 3 Part 4 Part 1 Part 2 Part 3 Part 4
Shape Model (Training # = 1) Shape Model (Training # = 5)
80
80
60
40 2
Appearance Model (Training # = 1)
60
40 2
Appearance Model (Training # = 5) Acknowledgments
20
0
-1 0 1
20
0 0
-1 0 1
We would like to thank David MacKay, Brian Ripley and Yaser Abu-
0 2
2
20
20
40 0
40
0
Mostafa for their most useful discussions.
-1 0 1 -1 0 1
60
60 2 2
80
80
References
0 0
100 50 0 50 100 -1 0 1 100 50 0 50 100 -1 0 1
(c) (d)
Correct Correct Correct Correct Correct INCORRECT INCORRECT Correct
[1] I. Biederman, “Recognition-by-Components: A Theory of Human Image Un-
derstanding.” Psychological Review, vol. 94, pp.115-147, 1987.
Correct Correct Correct Correct Correct Correct INCORRECT INCORRECT [2] H.A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection”,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 23-38, Jan, 1998.
[3] K. Sung and T. Poggio, “Example-based Learning for View-based Human Face
Correct INCORRECT Correct Correct Correct Correct INCORRECT Correct
Detection”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 39-51,
Jan, 1998.
Correct Correct Correct Correct Correct INCORRECT Correct Correct
[4] Y. Amit and D. Geman, “A computational model for visual selection”, Neural
Computation, vol. 11, no. 7, pp1691-1715, 1999.
[5] H. Schneiderman, and T. Kanade, “A Statistical Approach to 3D Object Detec-
(e) (f) tion Applied to Faces and Cars”, Proc. CVPR, pp. 746-751, 2000.
Figure 7: Summary of airplane model. [6] M. Weber, M. Welling and P. Perona, “Unsupervised learning of models for
recognition”, Proc. 6th ECCV, vol. 2, pp. 101-108, 2000.
[7] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
5. Conclusions and future work features”, Proc. CVPR, vol. 1, pp. 511-518, 2001.
[8] D.C. Knill and W. Richards, (ed.) Perception as Bayesian Inference, Cambridge
We have demonstrated that given a single example (or just University Press, 1996.
a few), we can learn a new object category. As Table 1
[9] M.C. Burl, M. Weber, and P. Perona, “A probabilistic approach to object recog-
shows, this is beyond the capability of existing algorithms. nition using local photometry and global geometry”, Proc. ECCV, pp.628-641,
In order to explore this idea we have developed a Bayesian 1998.
learning framework based on representing object categories [10] R. Fergus, P. Perona and A. Zisserman, “Object class recognition by unsuper-
with probabilistic models. “General” information coming vised scale-invariant learning”, Proc. CVPR, vol. 2, pp. 264-271, 2003.
from previously learnt categories is represented with a suit- [11] S. Waterhouse, D. MacKay, and T. Robinson, “Bayesian methods for mixtures
able prior probability density function on the parameters of of experts”, Proc. NIPS, pp. 351-357, 1995.
such models. Our experiments, conducted on realistic im- [12] D. MacKay, “Ensemble learning and evidence maximization”, Proc. NIPS,
ages of four categories, are encouraging in that they show 1995.
that very few (1 to 5) training examples produce models [13] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola and L.K. Saul, “An introduction to
that are already able to discriminate images containing the variational methods for graphical models”, Machine Learning, vol. 37, pp. 183-
desired objects from images not containing them with error 233, 1999.
rates around . [14] H. Attias, “Inferring parameters and structure of latent variable models by vari-
A number of issues are still unexplored. First and fore- ational bayes”, 15th conference on Uncertainty in Artificial Intelligence, pp. 21-
most, more comprehensive experiments need to be carried 30, 1999.
out on a larger number of categories, in order to understand [15] W.D. Penny, “Variational Bayes for d-dimensional Gaussian mixture models”,
Tech. Rep., University College London, 2001.
how prior knowledge improves with the number of known
categories, and how categorical similarity affects the pro- [16] T.P. Minka, “Using lower bounds to approximate integrals”, Tech. Rep.,
Carnegie Mellon University, 2001.
cess. Second, in order to make our experiments practical
we have simplified the probabilistic models that are used for [17] T. Kadir and M. Brady, “Scale, saliency and image description”, International
Journal of Computer Vision, vol. 45, no. 2, pp. 83-105, 2001.
representing objects. For example a probabilistic model for
occlusion is not implemented in our experiments [6, 9, 10]. [18] R.M. Neal and G.E. Hinton, “A view of the EM algorithm that justifies incre-
mental, sparse and other variants.” in Learning in Graphical Models edt. M.I.
Third, it would be highly valuable for practical applications
Jordan, pp. 355-368, Kluwer academic press, Norwell, 1998.
(e.g. a vehicle roving in an unknown environment) to de-
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003) 2-Volume Set
0-7695-1950-4/03 $17.00 © 2003 IEEE