Professional Documents
Culture Documents
Abstract
The process of learning good features for ma-
chine learning applications can be very compu-
tationally expensive and may prove difficult in
cases where little data is available. A prototyp-
ical example of this is the one-shot learning set-
ting, in which we must correctly make predic-
tions given only a single example of each new
class. In this paper, we explore a method for
learning siamese neural networks which employ
a unique structure to naturally rank similarity be-
tween inputs. Once a network has been tuned,
we can then capitalize on powerful discrimina-
tive features to generalize the predictive power of
the network not just to new data, but to entirely Figure 1. Example of a 20-way one-shot classification task using
new classes from unknown distributions. Using a the Omniglot dataset. The lone test image is shown above the grid
convolutional architecture, we are able to achieve of 20 images representing the possible unseen classes that we can
strong results which exceed those of other deep choose for the test image. These 20 images are our only known
learning models with near state-of-the-art perfor- examples of each of those classes.
mance on one-shot classification tasks.
Figure 4. Best convolutional architecture selected for verification task. Siamese twin is not depicted, but joins immediately after the
4096 unit fully-connected layer where the L1 component-wise distance between vectors is computed.
only those output units which were the result of complete µj , and L2 regularization weights λj defined layer-wise, so
overlap between each convolutional filter and the input fea- that our update rule at epoch T is as follows:
ture maps.
(T ) (i) (i) (T ) (T ) (i) (i)
The units in the final convolutional layer are flattened into wkj (x1 , x2 ) = wkj + ∆wkj (x1 , x2 ) + 2λj |wkj |
a single vector. This convolutional layer is followed by (T ) (i) (i) (T )
∆wkj (x1 , x2 ) = −ηj ∇wkj + µj ∆wkj
(T −1)
a fully-connected layer, and then one more layer com-
puting the induced distance metric between each siamese
twin, which is given to a single sigmoidal output unit. where ∇wkj is the partial derivative with respect to the
More precisely, the prediction vector is given as p = weight between the jth neuron in some layer and the kth
P (j) (j) neuron in the successive layer.
σ( j αj |h1,L−1 − h2,L−1 |), where σ is the sigmoidal
activation function. This final layer induces a metric on Weight initialization. We initialized all network weights
the learned feature space of the (L − 1)th hidden layer in the convolutional layers from a normal distribution with
and scores the similarity between the two feature vec- zero-mean and a standard deviation of 10−2 . Biases were
tors. The αj are additional parameters that are learned also initialized from a normal distribution, but with mean
by the model during training, weighting the importance 0.5 and standard deviation 10−2 . In the fully-connected
of the component-wise distance. This defines a final Lth layers, the biases were initialized in the same way as the
fully-connected layer for the network which joins the two convolutional layers, but the weights were drawn from a
siamese twins. much wider normal distribution with zero-mean and stan-
dard deviation 2 × 10−1 .
We depict one example above (Figure 4), which shows the
largest version of our model that we considered. This net- Learning schedule. Although we allowed for a different
work also gave the best result for any network on the veri- learning rate for each layer, learning rates were decayed
fication task. uniformly across the network by 1 percent per epoch, so
(T ) (T −1)
that ηj = 0.99ηj . We found that by annealing the
3.2. Learning learning rate, the network was able to converge to local
minima more easily without getting stuck in the error sur-
Loss function. Let M represent the minibatch size, where
(i) (i) face. We fixed momentum to start at 0.5 in every layer,
i indexes the ith minibatch. Now let y(x1 , x2 ) be a increasing linearly each epoch until reaching the value µj ,
length-M vector which contains the labels for the mini- the individual momentum term for the jth layer.
(i) (i)
batch, where we assume y(x1 , x2 ) = 1 whenever x1 and
(i) (i) We trained each network for a maximum of 200 epochs, but
x2 are from the same character class and y(x1 , x2 ) = 0
monitored one-shot validation error on a set of 320 one-
otherwise. We impose a regularized cross-entropy objec-
shot learning tasks generated randomly from the alphabets
tive on our binary classifier of the following form:
and drawers in the validation set. When the validation error
(i) (i) (i) (i)
L(x1 , x2 ) = y(x1 , x2 ) log p(x1 , x2 )+
(i) (i) did not decrease for 20 epochs, we stopped and used the
(i) (i) (i) (i)
parameters of the model at the best epoch according to the
(1 − y(x1 , x2 )) log (1 − p(x1 , x2 )) + λT |w|2 one-shot validation error. If the validation error continued
to decrease for the entire learning schedule, we saved the
Optimization. This objective is combined with standard
final state of the model generated by this procedure.
backpropagation algorithm, where the gradient is additive
across the twin networks due to the tied weights. We fix Hyperparameter optimization. We used the beta ver-
a minibatch size of 128 with learning rate ηj , momentum sion of Whetlab, a Bayesian optimization framework, to
Siamese Neural Networks for One-shot Image Recognition
Figure 5. A sample of random affine distortions generated for a Figure 6. The Omniglot dataset contains a variety of different im-
single character in the Omniglot data set. ages from alphabets across the world.
perform hyperparameter selection. For learning sched- like Latin and Korean to lesser known local dialects. It also
ule and regularization hyperparameters, we set the layer- includes some fictitious character sets such as Aurek-Besh
wise learning rate ηj ∈ [10−4 , 10−1 ], layer-wise momen- and Klingon (Figure 6).
tum µj ∈ [0, 1], and layer-wise L2 regularization penalty The number of letters in each alphabet varies considerably
λj ∈ [0, 0.1]. For network hyperparameters, we let the size from about 15 to upwards of 40 characters. All charac-
of convolutional filters vary from 3x3 to 20x20, while the ters across these alphabets are produced a single time by
number of convolutional filters in each layer varied from each of 20 drawers Lake split the data into a 40 alpha-
16 to 256 using multiples of 16. Fully-connected layers bet background set and a 10 alphabet evaluation set. We
ranged from 128 to 4096 units, also in multiples of 16. We preserve these two terms in order to distinguish from the
set the optimizer to maximize one-shot validation set accu- normal training, validation, and test sets that can be gener-
racy. The score assigned to a single Whetlab iteration was ated from the background set in order to tune models for
the highest value of this metric found during any epoch. verification. The background set is used for developing a
Affine distortions. In addition, we augmented the train- model by learning hyperparameters and feature mappings.
ing set with small affine distortions (Figure 5). For each Conversely, the evaluation set is used only to measure the
image pair x1 , x2 , we generated a pair of affine trans- one-shot classification performance.
formations T1 , T2 to yield x01 = T1 (x1 ), x02 = T2 (x2 ),
where T1 , T2 are determined stochastically by a multi- 4.2. Verification
dimensional uniform distribution. So for an arbitrary trans- To train our verification network, we put together three dif-
form T , we have T = (θ, ρx , ρy , sx , sy , tx , tx ), with θ ∈ ferent data set sizes with 30,000, 90,000, and 150,000 train-
[−10.0, 10.0], ρx , ρy ∈ [−0.3, 0.3], sx , sy ∈ [0.8, 1.2], and ing examples by sampling random same and different pairs.
tx , ty ∈ [−2, 2]. Each of these components of the transfor- We set aside sixty percent of the total data for training: 30
mation is included with probability 0.5. alphabets out of 50 and 12 drawers out of 20.
We fixed a uniform number of training examples per alpha-
4. Experiments bet so that each alphabet receives equal representation dur-
We trained our model on a subset of the Omniglot data set, ing optimization, although this is not guaranteed to the in-
which we first describe. We then provide details with re- dividual character classes within each alphabet. By adding
spect to verification and one-shot performance.
Table 1. Accuracy on Omniglot verification task (siamese convo-
4.1. The Omniglot Dataset lutional neural net)
The Omniglot data set was collected by Brenden Lake and
his collaborators at MIT via Amazon’s Mechanical Turk to Method Test
produce a standard benchmark for learning from few exam- 30k training
ples in the handwritten character recognition domain (Lake no distortions 90.61
et al., 2011).1 Omniglot contains examples from 50 alpha- affine distortions x8 91.90
bets ranging from well-established international languages 90k training
1
The complete data set can be obtained from Brenden Lake by no distortions 91.54
request (brenden@cs.nyu.edu). Each character in Omniglot affine distortions x8 93.15
is a 105x105 binary-valued image which was drawn by hand on 150k training
an online canvas. The stroke trajectories were collected alongside no distortions 91.63
the composite images, so it is possible to incorporate temporal affine distortions x8 93.42
and structural information into models trained on Omniglot.
Siamese Neural Networks for One-shot Image Recognition
Method Test
Humans 95.5
Hierarchical Bayesian Program Learning 95.2
Affine model 81.8
Hierarchical Deep 65.2
Figure 7. Examples of first-layer convolutional filters learned by Deep Boltzmann Machine 62.0
the siamese network. Notice filters adapt different roles: some Simple Stroke 35.2
look for very small point-wise features whereas others function 1-Nearest Neighbor 21.7
like larger-scale edge detectors. Siamese Neural Net 58.3
Convolutional Siamese Net 92.0
Method Test
1-Nearest Neighbor 26.5
Convolutional Siamese Net 70.3
Fe-Fei, Li, Fergus, Robert, and Perona, Pietro. A bayesian Srivastava, Nitish. Improving neural networks with
approach to unsupervised one-shot learning of object dropout. Master’s thesis, University of Toronto, 2013.
categories. In Computer Vision, 2003. Proceedings.
Ninth IEEE International Conference on, pp. 1134– Taigman, Yaniv, Yang, Ming, Ranzato, Marc’Aurelio, and
1141. IEEE, 2003. Wolf, Lior. Deepface: Closing the gap to human-level
performance in face verification. In Computer Vision and
Fei-Fei, Li, Fergus, Robert, and Perona, Pietro. One-shot Pattern Recognition (CVPR), 2014 IEEE Conference on,
learning of object categories. Pattern Analysis and Ma- pp. 1701–1708. IEEE, 2014.
chine Intelligence, IEEE Transactions on, 28(4):594–
611, 2006. Wu, Di, Zhu, Fan, and Shao, Ling. One shot learning
gesture recognition from rgbd images. In Computer
Hinton, Geoffrey, Osindero, Simon, and Teh, Yee-Whye. Vision and Pattern Recognition Workshops (CVPRW),
A fast learning algorithm for deep belief nets. Neural 2012 IEEE Computer Society Conference on, pp. 7–12.
computation, 18(7):1527–1554, 2006. IEEE, 2012.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing
systems, pp. 1097–1105, 2012.
Lake, Brenden M, Salakhutdinov, Ruslan, Gross, Jason,
and Tenenbaum, Joshua B. One shot learning of simple
visual concepts. In Proceedings of the 33rd Annual Con-
ference of the Cognitive Science Society, volume 172,
2011.
Lake, Brenden M, Salakhutdinov, Ruslan, and Tenenbaum,
Joshua B. Concept learning as motor program induction:
A large-scale empirical study. In Proceedings of the 34th
Annual Conference of the Cognitive Science Society, pp.
659–664, 2012.
Lake, Brenden M, Salakhutdinov, Ruslan R, and Tenen-
baum, Josh. One-shot learning by inverting a composi-
tional causal process. In Advances in neural information
processing systems, pp. 2526–2534, 2013.
Lake, Brenden M, Lee, Chia-ying, Glass, James R, and
Tenenbaum, Joshua B. One-shot learning of generative
speech concepts. Cognitive Science Society, 2014.
Lim, Joseph Jaewhan. Transfer learning by borrowing ex-
amples for multiclass object detection. Master’s thesis,
Massachusetts Institute of Technology, 2012.
Maas, Andrew and Kemp, Charles. One-shot learning with
bayesian networks. Cognitive Science Society, 2009.
Mnih, Volodymyr. Cudamat: a cuda-based matrix class for
python. 2009.
Palatucci, Mark, Pomerleau, Dean, Hinton, Geoffrey E,
and Mitchell, Tom M. Zero-shot learning with seman-
tic output codes. In Advances in neural information pro-
cessing systems, pp. 1410–1418, 2009.
Simonyan, Karen and Zisserman, Andrew. Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.