Read without ads and support Scribd by becoming a Scribd Premium Reader.
 
??, ??, 1–6 (??)
c
?? Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Statistical Learning Theory: a Primer
THEODOROS EVGENIOU AND MASSIMILIANO PONTIL
Center for Biological and Computational Learning, MIT 
Received ??. Revised ??.
Abstract.
In this paper we first overview the main concepts of Statistical Learning Theory, a frameworkin which learning from examples can be studied in a principled way. We then briefly discuss well known aswell emerging learning techniques such as Regularization Networks and Support Vector Machines whichcan be justified in term of the same induction principle.
Keywords:
VC-dimension, Structural Risk Minimization, Regularization Networks, Support Vector Ma-chines
1. Introduction
The goal of this paper is to provide a short in-troduction to Statistical Learning Theory (SLT)which studies problems and techniques of 
super-vised learning 
. For a more detailed review of SLTsee [5]. In supervised learning – or
learning-from-examples
– a machine is trained, instead of pro-grammed, to perform a given task on a number of input-output pairs. According to this paradigm,training means choosing a function which best de-scribes the relation between the inputs and theoutputs. The central question of SLT is how wellthe the chosen function generalizes, or how well itestimates the output for previously unseen inputs.We will consider techniques which lead to solu-tion of the form
(
x
) =
i
=1
c
i
(
x
,
x
i
)
.
(1)where the
x
i
,i
= 1
,...,l
are the input examples,
a certain symmetric positive definite function
This report describes research done within the Centerfor Biological and Computational Learning in the Depart-ment of Brain and Cognitive Sciences and at the Artifi-cial Intelligence Laboratory at the Massachusetts Instituteof Technology. This research is sponsored by grants fromthe National Science Foundation, ONR and Darpa. Addi-tional support is provided by Eastman Kodak Company,Daimler-Chrysler, Siemens, ATR, AT&T, Compaq, HondaR&D Co., Ltd., Merrill-Lynch, NTT and Central ResearchInstitute of Electric Power Industry.
named kernel, and
c
i
a set of parameters to bedetermined form the examples. This function isfound by minimizing functionals of the type
[
] =1
i
=1
(
y
i
,
(
x
i
)) +
λ
2
K
,
where
is a
loss function 
which measures thegoodness of the predicted output
(
x
i
) with re-spect to the given output
y
i
,
2
K
a smoothnessterm which can be thought of as a norm in theReproducing Kernel Hilbert Space defined by thekernel
and
λ
a positive parameter which con-trols the relative weight between the data and thesmoothness term. The choice of the loss func-tion determines different learning techniques, eachleading to a different learning algorithm for com-puting the coefficients
c
i
.The rest of the paper is organized as follows.Section 2 presents the main idea and conceptsin the theory. Section 3 discusses RegularizationNetworks and Support Vector Machines, two im-portant techniques which produce outputs of theform of equation (1).
2. Statistical Learning Theory
We consider two sets of random variables
x
R
d
and
y
R
related by a probabilis-tic relationship. The relationship is probabilis-tic because generally an element of 
does not
 
2
?? 
determine uniquely an element of 
, but rathera probability distribution on
. This can beformalized assuming that an unknown probabil-ity distribution
(
x
,y
) is defined over the set
×
. We are provided with
examples
of thisprobabilistic relationship, that is with a data set
D
{
(
x
i
,y
i
)
×
}
i
=1
called
training set 
, ob-tained by sampling
times the set
×
accord-ing to
(
x
,y
). The “problem of learning” consistsin, given the data set
D
, providing an
estimator 
,that is a function
:
, that can be used,given any value of 
x
, to predict a value
y
. Forexample
could be the set of all possible images,
the set
{−
1
,
1
}
, and
(
x
) an
indicator function 
which specifies whether image
x
contains a certainobject (
y
= 1), or not (
y
=
1) (see for example[12]). Another example is the case where
x
is aset of parameters, such as pose or facial expres-sions,
y
is a motion field relative to a particularreference image of a face, and
(
x
) is a regressionfunction which maps parameters to motion (seefor example [6]).In SLT, the standard way to solve the learn-ing problem consists in defining a
risk functional 
,which measures the average amount of error orrisk associated with an estimator, and then look-ing for the estimator with the lowest risk. If 
(
y,
(
x
)) is the loss function measuring the er-ror we make when we predict
y
by
(
x
), then theaverage error, the so called
expected risk 
, is:
[
]
 
X,Y 
(
y,
(
x
))
(
x
,y
)
d
x
dy
We assume that the expected risk is defined on a“large” class of functions
and we will denote by
0
the function which minimizes the expected riskin
. The function
0
is our ideal estimator, andit is often called the
target 
function. This functioncannot be found in practice, because the probabil-ity distribution
(
x
,y
) that defines the expectedrisk is unknown, and only a sample of it, the dataset
D
, is available. To overcome this shortcom-ing we need an
induction principle
that we canuse to “learn” from the limited number of train-ing data we have. SLT, as developed by Vapnik[15], builds on the so-called
empirical risk mini-mization (ERM)
induction principle. The ERMmethod consists in using the data set
D
to builda stochastic approximation of the expected risk,which is usually called the
empirical risk 
, definedas
emp
[
;
] =1
i
=1
(
y
i
,
(
x
i
))
.
Straight minimization of the empirical risk in
can be problematic. First, it is usually an
ill-posed 
problem [14], in the sense that there mightbe many, possibly infinitely many, functions min-imizing the empirical risk. Second, it can lead to
overfitting 
, meaning that although the minimumof the empirical risk can be very close zero, theexpected risk – which is what we are really inter-ested in – can be very large.SLT provides probabilistic bounds on the dis-tance between the empirical and expected risk of any function (therefore including the minimizerof the empirical risk in a function space that canbe used to control overfitting). The bounds in-volve the number of examples
and the
capacity 
h
of the function space, a quantity measuring the“complexity” of the space. Appropriate capacityquantities are defined in the theory, the most pop-ular one being the VC-dimension [16] or scale sen-sitive versions of it [9], [1]. The bounds have thefollowing general form: with probability at least
η
[
]
<
emp
[
] + Φ(
 
h,η
)
.
(2)where
h
is the capacity, and Φ an increasing func-tion of 
h
and
η
. For more information and forexact forms of function Φ we refer the reader to[16], [15], [1]. Intuitively, if the capacity of thefunction space in which we perform empirical riskminimization is very large and the number of ex-amples is small, then the distance between the em-pirical and expected risk can be large and overfit-ting is very likely to occur.Since the space
is usually very large (i.e.
could be the space of square integrable functions),one typically considers smaller hypothesis spaces
H
. Moreover, inequality (2) suggests an alterna-tive method for achieving good generalization: in-stead of minimizing the empirical risk, find thebest trade off between the empirical risk and the
complexity of the hypothesis space
measured by thesecond term in the r.h.s. of inequality (2). Thisobservation leads to the method of 
Structural Risk Minimization (SRM)
.
 
?? 3The idea of SRM is to define a nested sequenceof hypothesis spaces
1
2
...
M
, whereeach hypothesis space
m
has finite capacity
h
m
and larger than that of all previous sets, that is:
h
1
h
2
,...,
h
M
. For example
m
could bethe set of polynomials of degree
m
, or a set of splines with
m
nodes, or some more complicatednonlinear parameterization. Using such a nestedsequence of more and more complex hypothesisspaces, SRM consists of choosing the minimizerof the empirical risk in the space
m
for whichthe bound on the
structural risk 
, as measured bythe right hand side of inequality (2), is minimized.Further information about the statistical proper-ties of SRM can be found in [3], [15].To summarize, in SLT the problem of learn-ing form examples is solved in three steps: (a) wedefine a loss function
(
y,
(
x
)) measuring the er-ror of predicting the output of input
x
with
(
x
)when the actual output is
y
; (b) we define a nestedsequence of hypothesis spaces
m
,m
= 1
,...,M 
whose capacity is an increasing function of 
m
; (c)we minimize the empirical risk in each of 
m
andchoose, among the solutions found, the one withthe best trade off between the empirical risk andthe capacity as given by the right hand side of inequality (2).
3. Learning machines
3.1. Learning as functional minimization 
We now consider hypothesis spaces which aresubsets of a Reproducing Kernel Hilbert Space(RKHS) [17]. A RKHS is a Hilbert space of func-tions
of the form
(
x
) =
N
n
=1
a
n
φ
n
(
x
), where
{
φ
n
(
x
)
}
N
n
=1
is a set of given, linearly independentbasis functions and
can be possibly infinite. ARKHS is equipped with a norm which is definedas:
2
K
=
N
n
=1
a
2
n
λ
n
,
where
{
λ
n
}
N
n
=1
is a decreasing, positive sequenceof real values whose sum is finite. The constants
λ
n
and the basis functions
{
φ
n
}
N
n
=1
define thesymmetric positive definite kernel function:
(
x
,
y
) =
N
n
=1
λ
n
φ
n
(
x
)
φ
n
(
y
)
,
A nested sequence of spaces of functions in theRKHS can be constructed by bounding the RKHSnorm of functions in the space. This can be doneby defining a set of constants
A
1
< A
2
< ... < A
M
and considering spaces of the form:
m
=
{
RKHS 
:
K
A
m
}
It can be shown that the capacity of the hypothe-sis spaces
m
is an increasing function of 
A
m
(seefor example [5]). According to the scheme givenat the end of section 2, the solution of the learn-ing problem is found by solving, for each
A
m
, thefollowing optimization problem:min
i
=1
(
y
i
,
(
x
i
))subject to
K
A
m
and choosing, among the solutions found for each
A
m
, the one with the best trade off between em-pirical risk and capacity, i.e. the one which min-imizes the bound on the structural risk as givenby inequality (2).The implementation of the SRM method de-scribed above is not practical because it requiresto look for the solution of a large number con-strained optimization problems. This difficulty isovercome by searching for the minimum of:
[
] =1
i
=1
(
y
i
,
(
x
i
)) +
λ
2
K
.
(3)The functional
[
] contains both the empiricalrisk and the norm (complexity or smoothness) of 
in the RKHS, similarly to functionals consideredin regularization theory [14]. The
regularization parameter 
λ
penalizes functions with high capac-ity: the larger
λ
, the smaller the RKHS norm of the solution will be.When implementing SRM, the key issue is thechoice of the hypothesis space, i.e. the param-eter
m
where the structural risk is minimized.In the case of the functional of equation (3), thekey issue becomes the choice of the regularizationparameter
λ
. These two problems, as discussedin [5], are related, and the SRM method can inprinciple be used to choose
λ
[15]. In practice, in-stead of using SRM other methods are used such
Search History:
Searching...
Result 00 of 00
00 results for result for
  • p.
  • More From This User

    Notes
    Load more