Professional Documents
Culture Documents
methods. These methods usually converge after one or that ties together an efficient discriminative classifiers
only a few passes through the data in an online fashion with a generative manifold learning methods. Exper-
and were applicable to very large data sets. iments show that this method gets close to state-of-
the-art results for challenging classification problems
However, most real problems are not linearly sep-
while being significantly faster than any competing al-
arable. The main question is, whether there exists
gorithm. The complexity grows linearly with the size
a similar stochastic gradient approach for nonlinear
of the data set allowing the algorithm to be applied to
kernel svms. One way to tackle this problem is to
much larger data sets. We generalise the model to the
approximate a typically infinite dimensional kernel
locally finite dimensional kernel svm classifier with any
with a finite dimensional one (Maji & Berg, 2009;
finite dimensional or finite dimensional approxima-
Vedaldi & Zisserman, 2010). However, this method
tion (Maji & Berg, 2009; Vedaldi & Zisserman, 2010)
could be applied only to the class of additive kernels
kernel.
such as the intersection kernel. (Balcan et al., 2004)
proposed a method based on gradient descent on the An outline of the paper is as follows. In section 2
randomized projections of the kernel. (Bordes et al., we explain local codings for manifold learning. In sec-
2005) proposed a method called la-svm, that pro- tion 3 we describe the properties of locally linear classi-
poses the set of support vectors and performs stochas- fiers, approximate them using local codings, formulate
tic gradient descent to learn their optimal weights. the optimisation problem and propose qp-based and
They showed the equivalence of their method to smo stochastic gradient descent method method to solve
and proved convergence to the true qp solution. Even it. In section 4 we compare our classifier to other ap-
though this algorithm runs much faster than all previ- proaches in terms of performance and speed and in the
ous methods, it could not be applied to as large data last section 5 we conclude by listing some possibilities
sets as stochastic gradient descent for linear svms. This for future work.
is because the solution of the kernel method can be
represented only as a function of support vectors and 2. Local Codings for Manifold Learning
experimentally the number of support vectors grew lin-
early with the size of the training set (Bordes et al., Many manifold learning methods (Roweis & Saul,
2005). Thus the complexity of this algorithm depends 2000; Gemert et al., 2008; Zhou et al., 2009;
quadratically on the size of the training set. Gao et al., 2010; Yu et al., 2009; Wang et al., 2010),
also called local codings, approximate any point x on
Another issue with these kernel methods is, that the
the manifold as a linear combination of surrounding
most popular kernels such as rbf-kernel or intersec-
anchor points as:
tion kernel are often applied to a problem without any
X
justification or intuition about whether it is the right x≈ γv (x)v, (1)
kernel to apply. Real data usually lies on the lower di- v∈C
mensional manifold of the input space either due to
the nature of the input data or various preprocessing where C is the set of anchor points v and γv (x)
of the data like normalization of histograms or subsets is the vector ofP coefficients, called local coordinates,
of histograms (Dalal & Triggs, 2005). In this case the constrained by v∈C γv (x) = 1, guaranteeing invari-
general intuition about properties of a certain kernel ance to Euclidian transformations of the data. Gener-
without any knowledge about the underlying manifold ally, two types of approaches for the evaluation of the
may be very misleading. coefficients γ(x) have been proposed. (Gemert et al.,
2008; Zhou et al., 2009) evaluate these local coordi-
In this paper we propose a novel locally linear svm nates based on the distance of x from each anchor
classifier with smooth decision boundary and bounded point using any distance measure, on the other hand
curvature. We show how the functions defining the methods of (Roweis & Saul, 2000; Gao et al., 2010;
classifier can be approximated using any local cod- Yu et al., 2009; Wang et al., 2010) formulate the prob-
ing scheme (Roweis & Saul, 2000; Gemert et al., 2008; lem as the minimization of reprojection error using
Zhou et al., 2009; Gao et al., 2010; Yu et al., 2009; various regularization terms, inducing properties such
Wang et al., 2010) and show how this model can be as sparsity or locality. The set of anchor points is either
learned either by solving the corresponding qp pro- obtained using standard vector quantization meth-
gram or in an online fashion by performing stochastic ods (Gemert et al., 2008; Zhou et al., 2009) or by min-
gradient descent with the same convergence guarantees imizing the sum of the reprojection errors over the
as standard gradient descent method for linear svm. training set (Yu et al., 2009; Wang et al., 2010).
The method can be seen as a finite kernel method,
The most important property (Yu et al., 2009) of the
Locally Linear Support Vector Machines
transformation into any local coding is, that any Lip- data of certain class lies on several disjoint lower
schitz function f (x) defined on a lower dimensional dimensional manifolds and thus linear classifiers
manifold can be approximated by a linear combina- are inapplicable. However, all classification methods
tion of function values f (v) of the set of anchor points in general including non-linear ones try to learn
v ∈ C as: X the decision boundary between noisy instances of
f (x) ≈ γv (x)f (v) (2) classes, which is smooth and has bounded curvature.
v∈C Intuitively a decision surface that is too flexible would
tend to over fit the data. In other words all methods
within the bounds derived in (Yu et al., 2009). The
assume, that in a sufficiently small region the decision
guarantee of the quality of the approximation holds
boundary is approximately linear and the data is
for any normalised linear coding.
locally linearly separable. To encode local linearity
Local codings are unsupervised and fully generative of the svm classifier by allowing the weight vector w
procedures, that do not take class labels into account. and bias b to vary depending in the location of the
This implies a few advantages and disadvantages of point x in the feature space as:
the method. On one hand, local codings can be used n
to learn a manifold in semi-supervised classification X
H(x) = w(x)T x + b(x) = wi (x)xi + b(x). (6)
approaches. In some of the branches of machine learn-
i=1
ing, for example in computer vision, obtaining a large
amount of labelled data is costly, whilst obtaining any
Data points xi ∈ x typically lie in a lower dimensional
amount of unlabelled data is less so. Furthermore, lo-
manifold of the feature space. Usually they form sev-
cal codings can be applied to learn manifolds from
eral disjoint clusters e.g. in visual animal recognition
joint training and test data for transductive prob-
each cluster of the data may correspond to a different
lems (Gammerman et al., 1998). On the other hand
species and this can not be captured by linear classi-
for unbalanced data sets unsupervised manifold learn-
fiers. Smoothness and constrained curvature of the de-
ing methods, that ignore class labels, may be biased
cision boundary implies that the functions w(x) and
towards the manifold of the dominant class.
b(x) are Lipschitz in the feature space x. Thus we can
approximate the weight functions wi (x) and bias func-
3. Locally Linear Classifiers tion b(x) using any local coding as:
A standard linear svm binary classifier takes the form: n X
X X
H(x) = γv (x)wi (v)xi + γv (x)b(v)
n
X i=1 v∈C v∈C
H(x) = wT x + b = wi xi + b, (3) Ã n
!
X X
i=1
= γv (x) wi (v)xi + b(v) . (7)
where n is the dimensionality of the feature vector x. v∈C i=1
Figure 1. Best viewed in colour. Locally linear svm classifier for banana function data set. Red and green points correspond
to positive and negative samples, black stars correspond to the anchor points and blue lines are obtained decision boundary.
Even though the problem is obviously not linearly separable, locally in sufficiently small regions the decision boundary is
nearly linear and thus the data can be separated reasonably well using local linear classifier.
where HW,b (xk ) = (γ(xk )T Wxk + γ(xk )T b), S is where Wt and bt is the solution after t iterations,
1
the set of training samples, xk the feature vector and xt γ(xt )T is the outer product and λ(t+t 0)
is the opti-
yk ∈ {−1, 1} is the correct label of the k-th sample. mal learning rate (Shalev-Shwartz et al., 2007) with a
We will call the classifier obtained by this optimisation heuristically chosen positive constant t0 (Bordes et al.,
procedure locally linear svm (ll-svm). This formula- 2009), that ensures the first iterations do not pro-
tion is very similar to standard linear svm formulation duce too large steps. Because local codings either
over nm dimensions except there are several biases. force (Roweis & Saul, 2000; Wang et al., 2010) or in-
This optimisation problem can be converted to a qp duce (Gao et al., 2010; Yu et al., 2009) sparsity, the
problem with quadratic cost function subject to linear update step is done only for a few columns with non-
constraints similarly to standard svm as: zero coefficients of γ(x).
λ 1 X Regularisation update is done every skip iterations to
arg minW,b ||W||2 + ξk (10)
2 |S| speed up the process similarly to (Bordes et al., 2009)
k∈S
s.t.∀k ∈ S : ξk ≥ 0 as:
0 skip
ξk ≥ 1 − yk (γ(xk )T Wxk + γ(xk )T b). Wt+1 = Wt+1 (1 − ). (13)
t + t0
Solving this qp problem can be rather expensive for Because the proposed model is equivalent to a sparse
large data sets. Even though decomposition meth- mapping into the higher dimensional space, it has the
ods such as smo (Platt, 1998; Chang & Lin, 2001) same theoretical guarantees as standard linear svm
or svmlight (Joachims, 1999) can be applied to the and for given number of anchor points m is slower
dual representation of the problem, it has been ex- than linear svm only by a constant factor independent
perimentally shown that for real applications they are on the size of the training set. In case the stochastic
outperformed by stochastic gradient descent meth- gradient method does not converge in one pass and
ods. We adapt the SVMSGD2 method proposed local coordinates were expensive to evaluate, they can
in (Bordes et al., 2009) to tackle the problem. Each be evaluated once and kept in the memory.
iteration of SVMSGD2 consists of picking random
sample xt and corresponding label yt and updating This binary classifier can be extended to multi-class
the current solution of the W and b if the hinge loss one either by following standard one vs. all strategy
cost 1 − yt (γ(xt )T Wxt + γ(xt )T b) is positive as: or using the formulation of (Crammer & Singer, 2002).
1
Wt+1 = Wt + yt (xt γ(xt )T ) (11) 3.1. Relation and comparison to other models
λ(t + t0 )
1 Conceptually similar local linear classifier has been al-
bt+1 = bt + yt γ(xt ), (12) ready proposed by (Zhang et al., 2006). Their knn-
λ(t + t0 )
Locally Linear Support Vector Machines
Table 1. A comparison of the performance, training and test times of ll-svm with the state-of the art algorithms
on mnist data set. All kernel svm methods (Chang & Lin, 2001; Bordes et al., 2005; Crammer & Singer, 2002;
Tsochantaridis et al., 2005; Bordes et al., 2007) used rbf kernel. Our method achieved comparable performance to the
state-of-the-art and could be seen as a good trade-off between very fast linear svm and the qualitatively best kernel meth-
ods. ll-svm was approximately 50-3000 times faster than different kernel based methods. As the complexity of kernel
methods grow more than linearly, we expected larger relative difference for larger data sets. Running times of mcsvm,
svmstruct and la-rank are as reported in (Bordes et al., 2007) and thus only illustrative. N/A means the running times
are not available.)
Table 2. A comparison of error rates and training times for ll-svm and the state-of the art algorithms on usps and
letter data sets. ll-svm was significantly faster than kernel based methods, but it requires more data to achieve results
close to the state-of-the-art. The training times of competing methods are as they were reported in (Bordes et al., 2007).
Thus they are not directly comparable, but give a broad idea about the time consumptions of different methods.
usps letter
Method error training time error training time
Linear svm (Bordes et al., 2009) 9.57% 0.26 s 41.77% 0.18 s
mcsvm (Crammer & Singer, 2002) 4.24% 60 s 2.42% 1200 s
svmstruct (Tsochantaridis et al., 2005) 4.38% 6300 s 2.40% 24000 s
la-rank (Bordes et al., 2007) (1 pass) 4.25% 85 s 2.80% 940 s
ll-svm (10 passes) 5.78% 6.2 s 5.32% 4.2 s
Table 3. A comparison of the performance, training and test times for the locally linear, locally additive svm and the
state-of the art algorithms on caltech (Fei-Fei et al., 2004) data set. Locally linear svm obtained similar result as the
approximation of the intersection kernel svm designed for bag-of-words models. Locally additive svm achieved competitive
performance to the state-of-the-art methods. N/A means the running times are not available.
nearest-neighbor based image classification. CVPR, Joachims, Thorsten. Making large-scale support vector
2008. machine learning practical, pp. 169–184. MIT Press,
1999.
Bordes, A., Ertekin, S., Weston, J., and Bottou, L.
Fast kernel classifiers with online and active learn- Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of
ing. JMLR, 2005. features: Spatial pyramid matching for recognizing
natural scene categories. In CVPR, 2006.
Bordes, A., Bottou, L., Gallinari, P., and Weston,
J. Solving multiclass support vector machines with Maji, S. and Berg, A. C. Max-margin additive classi-
larank. In ICML, 2007. fiers for detection. ICCV, 2009.
Bordes, A., Bottou, L., and Gallinari, P. Sgd-qn: Care- Platt, J. Fast training of support vector machines us-
ful quasi-newton stochastic gradient descent. JMLR, ing sequential minimal optimization. In Advances
2009. in Kernel Methods - Support Vector Learning. MIT
Press, 1998.
Bosch, A., Zisserman, A., and Munoz, X. Representing
shape with a spatial pyramid kernel. In CIVR, 2007. Roweis, S. T. and Saul, L. K. Nonlinear dimensionality
reduction by locally linear embedding. Science, 290:
Breiman, L. Random forests. In Machine Learning, 2323–2326, 2000.
2001.
Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-
Chang, C. and Lin, C. Libsvm: A Library for Support neighbor methods in learning and vision: Theory
Vector Machines, 2001. and practice, 2006.
Cortes, C. and Vapnik, V. Support-vector networks. Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos:
In Machine Learning, 1995. Primal estimated sub-gradient solver for svm. In
ICML, 2007.
Crammer, K. and Singer, Y. On the algorithmic im-
plementation of multiclass kernel-based vector ma- Shechtman, E. and Irani, M. Matching local self-
chines. JMLR, 2002. similarities across images and videos. In CVPR,
2007.
Dalal, N. and Triggs, B. Histograms of oriented gra-
Tsochantaridis, I., Joachims, T., Hofmann, T., and Al-
dients for human detection. In CVPR, 2005.
tun, Y. Large margin methods for structured and
Farhadi, A., Tabrizi, M.K., Endres, I., and Forsyth, interdependent output variables. JMLR, 2005.
D.A. A latent model of discriminative aspect. In
Vapnik, V. and Lerner, A. Pattern recognition us-
ICCV, 2009.
ing generalized portrait method. automation and re-
Fei-Fei, L., Fergus, R., and Perona, P. Learning gen- mote control, 1963.
erative visual models from few training examples an Vedaldi, A. and Zisserman, A. Efficient additive ker-
incremental bayesian approach tested on 101 object nels via explicit feature maps. In CVPR, 2010.
categories. In Workshop on GMBS, 2004.
Vedaldi, A., Gulshan, V., Varma, M., and Zisserman,
Freund, Y. and Schapire, R. E. A short introduction A. Multiple kernels for object detection. In ICCV,
to boosting, 1999. 2009.
Gammerman, A., Vovk, V., and Vapnik, V. Learning Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. S., and
by transduction. In UAI, 1998. Gong, Y. Locality-constrained linear coding for im-
age classification. In CVPR, 2010.
Gao, S., Tsang, I. W.-H., Chia, L.-T., and Zhao, P. Lo-
cal features are not lonely - laplacian sparse coding Yu, K., Zhang, T., and Gong, Y. Nonlinear learning
for image classification. In CVPR, 2010. using local coordinate coding. In NIPS, 2009.
Gemert, J. C. Van, Geusebroek, J., Veenman, C. J., Zhang, H., Berg, A. C., Maire, M., and Malik, J. Svm-
and Smeulders, A. W. M. Kernel codebooks for knn: Discriminative nearest neighbor classification
scene categorization. In ECCV, 2008. for visual category recognition. CVPR, 2006.
Guyon, I., Boser, B., and Vapnik, V. Automatic ca- Zhou, X., Cui, N., Li, Z., Liang, F., and Huang, T.S.
pacity tuning of very large vc-dimension classifiers. Hierarchical gaussianization for image classification.
In NIPS, 1993. In ICCV, 2009.