Professional Documents
Culture Documents
!
Machine Learning 1!
!
Christian Wolf!
INSA-Lyon!
LIRIS!
Méthodes avancées en image et vidéo!
C. Wolf! Doua! Machine Learning 1 (introduction, classification)!
C. Wolf! Doua! Machine Learning 2 (features, deep learning,
modèles graphiques)!
M. Ardabilian! ECL! Indexation par le contenu : évaluation des techniques
et systèmes à base d’images!
M. Ardabilian! ECL! Acquisition multi-image et applications!
M. Ardabilian! ECL! Approches de super-résolution et applications!
M. Ardabilian! ECL! Méthodes de fusion en imagerie!
M. Ardabilian! ECL! Détection, analyse et reconnaissance d’objets!
J. Mille! Doua! Méthodes variationnelles!
M. Ardabilian! ECL! La biométrie faciale!
E. Dellandréa! ECL! Le son au sein de données multimédia : codage,
compression et analyse!
S. Duffner! Doua! Suivi d’objets!
E. Dellandréa! ECL! Un exemple d'analyse audio : Reconnaissance de
l'émotion à partir de signaux de parole et de musique!
S. Bres! Doua! Indexation d’images et de vidéos!
V. Eglin! Doua! Analyse de documents numériques 1!
F. LeBourgeois! Doua! Analyse de documents numériques 2!
Sommaire!
1 Introduction!
0 – Principes générales!
1
– Fitting et généralisation, complexité des modèles!
0 1
– Minimisation du risque empirique!
x
Microsoft Research, !
Cambridge, UK
Google!
Formerly : University of British Columbia, !
Reconnaissance de formes!
Classification supervisée!
Luc!
Boite!
Lego!
1
4
5
0
2
3
7
9
4
6
Fitting et Généralisation!
• Données générées par une fonction!
• Objectif : supposant la fonction inconnue, prédire t à partir de x!
ng data set of N =
wn as blue circles,
ng an observation
1
riable x along with
ding target variable t
curve shows the
πx) used to gener-
Our goal is to pre- 0
of t for some new
hout knowledge of
e.
−1
0 x 1
Underfitting! Underfitting!
1 M =0 1 M =1
t t
0 0
−1 −1
0 x 1 0 x 1
Overfitting!
1 M =3 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
ous orders M , shown as red curves, fitted to the data set shown in
Figure 1.5 Graphs of the root-mean-square
error, defined by (1.3), evaluated 1
on the training set and on an inde- Training
pendent test set for various values Test
Root Mean
of M . Square Error (RMS)!
by !
ERMS = 2E(w⋆ )/N ERMS 0.5
(1.3)
n by N allows us to compare different sizes of data sets on
the square root ensures that ERMS is measured on the same
me units) as the target variable t. Graphs of the training and
are shown, for various values of M , in Figure 0 1.5. The test
0 3 M 6 9
e of how well we are doing in predicting the values of t for
s of x. We note from Figure 1.5 that small values of M give
s of the test setForerror,
M =and thistraining
9, the can besetattributed tozero,
error goes to the fact
as wethat might expect because
lynomials this
arepolynomial contains 10
rather inflexible degrees
and are of freedom corresponding
incapable of capturing to the 10 coefficients
[C. Bishop, Pattern recognition and Machine learning, 2006]!
Validation croisée!
La séparation des données en deux parties change itérativement. La
mesure de performance est la moyenne sur toutes les itérations.!
1 N = 15 1 N = 100
t t
0 0
−1 −1
0 x 1 0 x 1
Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9
polynomial
M=9! for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the
size of the data set reduces the over-fitting problem.
ing polynomial function matches each of the data points exactly, but between data
[C. Bishop, Pattern recognition and Machine learning, 2006]!
points (particularly near the ends of the range) the function exhibits the large oscilla-
tions observed in Figure 1.4. Intuitively, what is happening is that the more flexible
Régularisation!
may wish to use relatively complex and flexible models. One technique that is often
used to control the over-fitting phenomenon in such cases is that of regularization,
which involves adding a penalty term to the error function (1.2) in order to discourage
Ajouter un terme supplémentaire : restriction des paramètres du
the coefficients from reaching large values. The simplest such penalty term takes the
formmodèle
of a sum! of squares of all of the coefficients, leading to a modified error function
of the form Paramètre de régularisation!
"N
! 1 2 λ
E(w) = {y(xn , w) − tn } + ∥w∥2 (1.4)
2 2
n=1
10 1. INTRODUCTION w0 est souvent exclu !
where ∥w∥ ≡ wT w = w02
2
+ w12 + . . . + wM 2
, and the coefficient λ governs the rel-
ative importance of the regularization term compared with the sum-of-squares error
term. 1Note that often the lncoefficient λ = −18
w0 is1omitted from the regularizer ln λ = 0
because its
inclusion
t
causes the results to depend on the t
choice of origin for the target variable
(Hastie et al., 2001), or it may be included but with its own regularization coefficient
(we shall0 discuss this topic in more detail in0 Section 5.5.1). Again, the error function
in (1.4) can be minimized exactly in closed form. Techniques such as this are known
in the−1statistics literature as shrinkage methods −1
because they reduce the value of the
coefficients. The particular case of a quadratic regularizer is called ridge regres-
sion (Hoerl 0 and Kennard, 1970). xIn the 1 context 0 of neural networks, this x
approach
1 is
known as weight decay.
Figure
M=9!1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
Figure
function 1.7
(1.4) shows
for two values theof theresults of parameter
regularization fitting the polynomial
λ corresponding of−18
to ln λ = order
and lnMλ = 0. =The9 to the
samecasedata
of noset as before
regularizer, i.e., λ =but now using
0, corresponding to lnthe regularized
λ = −∞, [C.
error
Bishop,
is shown at the bottomfunction
Pattern recognition
given
and
right of Figure bylearning,
Machine
1.4. (1.4).2006]!
La classification supervisée!
Nouvelle entrée!
Classe!
Classifieur!
Algorithme
d’apprentissage!
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5
4
p(x|C2 )
1.2
1
p(C1 |x) !
p(C2 |x)
3
0.6
2
(Bayésiens)!
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!
x x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
où
Dim
2
1
4
5
0
2
3
7
9
4
Nouvelle
donnée
6
Dim
1
Optimal dans la limite d’un nombre infini de données d’apprentissage.!
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5
4
p(x|C2 )
1.2
1
p(C1 |x)
!
p(C2 |x)
0.8
class densities
2
p(x|C1 )
0.6
0.4
Classifieurs génératifs (Bayésiens)!
1
0.2
0
0 0.2 0.4
x
0.6 0.8 1
0
0 0.2 0.4
x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
0.6 0.8
! 1
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be waste-
!
ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
Réseaux de neurones!
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
!
tree 1
Forêts aléatoires!
KERNEL MACHINES 𝑃
! p(x|C2 )
4
!
!
class densities
3
! Probabilité postérieure!
! 2
A notre disposition : la vraisemblance ! p(x|C1 )
des données (likelihood)! 1
0
0 0.2 0.4 0.6 0.8 1
x
[C. Bishop, Pattern recognition and Machine learning, 2006]!
ht expect probabilities to play a role in making decisions. When we ob
y image x for a new patient, our goal is to decide which of the two cl
Modélisation Bayesienne!
gn to the image. We are interested in the probabilities of the two classe
mage, which are given by p(Ck |x). Using
Pour obtenir la probabilité postérieure,
Bayes’ theorem, these prob
il faut inverser le modèle à l’aide
be expressed inde
de la réglé theBayes
form:!
Vraisemblance! Probabilité a priori!
!
p(x|Ck )p(Ck )
! p(Ck |x) = .
! p(x)
44 1. INTRODUCTION
Peut être ignorée pour la maximisation!
La probabilité a priori modélise!
that any of the quantities appearing in Bayes’ theorem can be obtain 44 1. INTRODUCTION
p(x|C2 )
1
le résultat, indépendamment!
We can now1 interpret p(Ck ) as the prior probability
4
p(x|C2 )
ppropriate
4 variables. 0
class densities
des observations .! 3
Dans
3 les cas simples, elle! p(x|C1 )
0
s the probability that a person has0.6cancer, before we take the X-ray measu 1
0
of the information
p(x|C )
contained in0.4the X-ray. If our aim is to minimize the
1
Figure 1.27 Example of the class-conditional densities for tw
plot) together with the corresponding posterior probabilities (r
class-conditional density p(x|C1 ), shown in blue on the left plot,
vertical green line in the right plot shows the decision bounda
higher00 posterior
0.2 0.4
probability.
0.6 0.8 1
We now
0
0
show
0.2 0.4
that0.6
this 0.8
intuition
1
is correct, 1994; Tarassenko, 1995).
However, if we only wish to make
ful of computational resources, and ex
e.g.! 5 1.2
p(C1 |x) p(C2 |x)
p(x|C2 )
1
4
0.8
class densities
3
0.6
2
0.4
p(x|C1 )
1
0.2
4
p(x|C2 )
1.2
1
p(C1 |x) !
p(C2 |x)
3
0.6
2
(Bayésiens)!
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!
x x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
classes 4.1.1
and then Two investigateclasses the extension
y(x)
Modèle linéaire pour la classification! r= . (4.7)
The simplest representation of a linear ∥w∥ discriminant function is obtained by tak-
ingUne fonction
a linear function de décision
of the input estvector
modélisée
so that par une fonction paramétrique :!
This result is illustrated in Figure 4.1.
r discriminant function is obtained by tak-
As with the linear regression y(x) = in
models x + w0 3, it is sometimes convenient
wTChapter (4.4)
that
o use a more compact notation in which we introduce an additional dummy ‘input’
T where
x +xw0 0=
wvalue w isand
est1un
called
then
biais
a weight
quidefine
peut être w!vector,
= (w and
(4.4)
intégré 0dans 0 is x
, w)wles
and a bias
!
autres (x(not
= paramètresto bethat
0 , x) so en
confused
ajoutant la
with bias in
the constante
statistical« 1 »
sense). The negative of the bias is sometimes called a threshold. An
aux entrées!
0 is input (not toxbeisconfused
a biasvector assignedwith to class
bias Cin
y(x) =y(x)
1 if w
! Tx!!. 0 and to class C2 otherwise. The cor-
(4.8)
biasresponding
is sometimes decision
called boundary
a threshold. is therefore
An defined by the relation y(x) = 0, which
Interprétation : !
In
x) this
! case,to the
0 and
corresponds todecision
class Ca2(D −surfaces
otherwise. Theare
1)-dimensionalcor-D-dimensional hyperplanes
hyperplane within passing through
the D-dimensional input
!
ehe originby
defined
space. of the
ConsiderD +two
therelation 1-dimensional
points
y(x) = 0, xAwhichexpanded
and xB both input space.lie on the decision surface.
of which
yperplane
Because within
y(xAthe ) =D-dimensional
y(xB ) = 0, we input
have wT (xA − xB ) = 0 and hence the vector w is
4.1.2
of whichMultiple
bothorthogonal lie
to on
every classes
thevector
decision surface.
lying within the decision surface, and so w determines the
T
(xorientation
ANow ) = 0ofand
− xBconsider the hence
the the vector
extension
decision w isSimilarly,
of linear
surface. discriminants K >on2 the
if x is atopoint classes. We surface,
decision might
bedecision
tempted
then y(x)surface,
be=to0,and
build
and awK-class
so so determines
the normal the
discriminant
distance from by combining
the origin to a number of two-class
the decision surface is
rly, if x
La
discriminant
given is a point
constante
by functions.on
« 1 » the decision
ajoute surface,
However, this leads to some serious difficulties (Duda and
e from1973)
Hart, the origin
un « biais »
as we to now
au the decision
modèle
show. surfacewisT x
!
w0
Consider the use of K −1 classifiers =−
each of∥w∥
which. solves a two-class problem(4.5) of
w ∥w∥
0
target vector t. One justification for using least squares in such a context is that it
for every possible
approximates pair of classes.
the conditional This is
expectation known
E[t|x] as atarget
of the values givenclassifier.
one-versus-one the input Each
point isFor
vector. thentheclassified according
binary coding to athis
scheme, majority vote amongst
conditional expectation theisdiscriminant
given by thefunc-
tions. of
vector Modèles linéaires (K classes)!
However,
posteriorthis tooprobabilities.
class runs into theUnfortunately,
problem of ambiguous
however, theseregions, as illustrated
probabilities
in the
are typically right-hand diagramrather
approximated of Figure 4.2.
poorly, indeed the approximations can have values
outside Wethecan
Plusieurs
avoid
range (0, these
1), due
fonctions
difficulties by considering
to the limited
paramétriques flexibility
:! of a linear
singlemodel
K-class discriminant
as we shall
comprising
see shortly. K linear functions of the form
!
Each class Ck is described by its own linear model so that
! yk (x) = wk x + wk0
T
(4.9)
yk (x) = wkT x + wk0 (4.13)
!
and then assigning a point x to class Ck if yk (x) > yj (x) for all j ̸= k. The decision
boundary
where = between
1, . . . ,vectorielle,
Enk notation K.class Ck and
We can class
conveniently
et en is therefore
Cj group
intégrant lethese given
biaistogether
:! by yk (x)
using = ynota-
vector j (x) and
tion
hence corresponds to a (D − 1)-dimensional hyperplane defined by
so that
y(x) = W " Tx # (4.14)
(wk − wj )T x + (wk0 − wj 0 ) = 0. (4.10)
This has the same form as the decision boundary for the two-class case discussed in
Interprétation
Section 4.1.1, and so analogous geometrical properties apply. : !
The decision regions of such a discriminant! are always singly connected and
convex. To see this, consider two points xA and x! B both of which lie inside decision
region Rk , as illustrated in Figure 4.3. Any point x
! that lies on the line connecting
« The winner takes it all »!
xA and xB can be expressed in the form
!
! = λxA + (1 − λ)xB
x (4.11)
Problème simple (1D, 3 classes)!
Classifieur linéaire : entrées 1D, 3 classes!
Entrée :! Paramètres :!
Données d’apprentissage!
Frontières « réelles » entre classes!
Frontières estimées!
Le cas non-linéaire!
Prétraitement : transformation non-linéaire des données !
(à choisir préalablement selon l’application) :!
204 4. LINEAR MODELS FOR CLASSIFICATION
1
1
φ2
x2
0 0.5
−1
0
−1 0 x1 1 0 0.5 φ1 1
Figure 4.12 Illustration of the role of nonlinear basis functions in linear classification models. The left plot
Fonctions de bases Gaussiennes!
shows the original input space (x1 , x2 ) together with data points from two classes labelled red and blue. Two
‘Gaussian’ basis functions φ1 (x) and φ2 (x) are defined in this space with centres shown by the green crosses
and with contours shown by the green circles. The right-hand plot shows the corresponding feature space
(φ1 , φ2 ) together with the linear decision boundary obtained given by a logistic regression model of the form
discussed in Section 4.3.2. This corresponds to a nonlinear decision boundary in the original input space,
shown by the black curve in the left-hand plot.
[C. Bishop, Pattern recognition and Machine learning, 2006]!
an understanding of their more complex
4.3.2 Logistic regression
We begin our treatment of generalized linear models by considering the problem
Régression logistique (2 classes)!
of two-class classification. In our discussion of generative approaches in Section 4.2,
we saw that under rather general assumptions, the posterior probability of class C1
can be Modèle writtenlinéaire
as a logistic sigmoid acting
(sur entrées on a linear
transformées) + function of theàfeature
non-linéarité vector
la sortie!
linear
φ so ! that models by considering the problem ! T "
n of generative
! approachesp(C1 |φ) = iny(φ)Section
= σ w4.2, φ (4.87)
ions,
with ! p(C the2 |φ)
posterior
= 1 − p(Cprobability
1 |φ). Here σ(·) of class
is the C1sigmoid function defined by
logistic
(4.59). In the terminology of
Modélisationstatistics,
directe de this
la model is known as logistic regression,
on a ! linear function of
probabilité the feature
postérieure! vector
although it should be emphasized that this is a model for classification rather than
ROBABILITY
regression.! ! DISTRIBUTIONS
"
) = For σ est wM
an T
la -dimensional
φ feature space
fonction logistique φ, this (4.87)
(« sigmoïde ») model has M adjustable
assurant parameters.
que les sorties sont
Bywhich
contrast,
entre if
we0can we
1! had
et solve fitted
for µ Gaussian
to give µ = σ(η), class
whereconditional densities using maximum
s likelihood,
the logistic we would
sigmoid have used 2M parameters
function defined forbythe means and M (M + 1)/2
parameters for the (shared) covariance 1 prior p(C1 ),
σ(η) = matrix. Together with the class (2.199)
isthismodel
gives a total is known as logistic
of M (M +5)/2+1 1 + regression,
exp(−η)which grows quadratically with M ,
parameters,
s iniscontrast
isacalled modelthe thefor
to logistic
linearclassification
dependence
sigmoid function. on Thus Mrather
weofcan than
thewrite
number of parameters
the Bernoulli in logistic
distribution
regression. For large
using the standard values of (2.194)
representation M , therein theis form
a clear advantage in working with the
logistic regression model directly.
, thisWemodel now use has M adjustable
maximum p(x|η) = σ(−η)
likelihood parameters.
to exp(ηx) the parameters of
determine (2.200)
the logistic
ass conditional
regression
where wemodel. To
have used densities
1do this, =
− σ(η) using
weσ(−η),
shall make
which maximum
use
is of the
easily derivative
proved of theCom-
from (2.199). logistic sig-
moidparison function, whichshows
with (2.194) can conveniently
that be expressed in terms of the sigmoid function
meters
itself for the means and M (M + 1)/2
4.3. Probabilistic Discriminative Models 209
∂y∂a k
= yk (Ikj − yj ) (4.106)
j= yk (Ikj − yj ) (4.106)
where Ikj are the elements of the identity ∂aj matrix.
Apprentissage des paramètres!
Next
the 1-of-K
we Iwrite
where
where Nextkj are
Icoding wethe
down the likelihood function. This is most easily done using
kj are the elements of the identity matrix.
elements
scheme
write ofthe
in which
down thelikelihood
identity
the target matrix.
vector tn This
function. for aisfeature
most easily vectordone
φn using
belongingtheNext
1-of-Kwe write
to class k is adown
Ccoding binary
scheme thevector
likelihood
in which with all function.
the elements This
target vectorzero isexcept
tnmost aeasily
for for done
element
feature k,usingφ
vector n
which thebelonging
1-of-K
equals
La vraisemblance one.coding
The
des scheme
likelihood
données in which
function
est the
is
la target
then given
probabilité vector
by t for
d’obtenir
to class Ck is a binary vector with all elements zero except for element k,
n a feature
les vector φ n
belonging
observations which étant to
equals class CkThe
one.
donné is ales
binary
likelihood vector
paramètres with all
function elements
isetthen
lesNgiven zero
by except
entrées :! for element k,
N K K
which equals one. The likelihood " "function is then given "" by t
p(T|w1 , . . . , wK ) = "Np(C " k
K |φ n )tnk
= "y
N
nk
nk"K (4.107)
k"N " K k"N " K tnk
p(T|w1 , . . . , wnK=1 )= =1 p(Ck |φntnk )ntnk
=1 = =1
tnkynk (4.107)
p(T|w1 , . . . , wK ) = n=1 p(Ck |φn ) = n=1 ynk (4.107)
where ynk = yk (φn ), and T is an N n=1 ×k=1Kk=1 matrix of target nvariables
=1 k=1
k=1
with elements
tnk . Taking
wherethe negative
yk (φlogarithm
= paramètres then gives× K matrix of target variables with elements
Pour estimer ynk les n ), and T is an, N on minimisera une fonction
where
tnk(ley
. Taking
nk = y (φ ), and
the negative
k n T is an
logarithmde N × K
thenlagivesmatrix of target variables with elements
d’erreur logarithme négatif
tnk . Taking the negative logarithm then gives vraisemblance)
##N K :!
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − Ntnk# ln ynk
K (4.108)
#
N K
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK )n=1 =− k#=1 # t ln ynk (4.108)
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − tnknk ln ynk (4.108)
which is known as the cross-entropy error function for then=1 multiclass
n=1 k=1
k=1
classification
problem. which is known as the estcross-entropy error function for the multiclass classification
CetteWe fonction
whichnow istake
known de
the cout
as the cross-entropy
gradient of theconnue
error errorsous
function function
withlerespect
nom!
for thetomulticlass
one of theclassification
param-
problem.
problem.wj . Making use
eter vectors « Cross-entropy
of the result (4.106) for loss »!
the derivatives of the softmax
We now take the gradient of the error function with respect to one of the param-
4.18 Enfunction,
calculant We
we now
obtain
son takegradient,
the gradient of the error function minimisée
with respect to :!one of the param-
eter vectors wj . Making useelle of thepeut result être
(4.106) for the derivatives of the softmax
eter vectors wj . Making use of the result (4.106) for the derivatives of the softmax
rcise 4.18
ise 4.18 function,
function, wewe obtain
obtain # N
∇wj E(w1 , . . . , wK ) = (ynjN− tnj ) φn (4.109)
n=1 # N #
∇ E(w , . . . , w ) = (y(y nj − tnj ) φ (4.109)
1 , . . . , wK ) = nj − tnj ) φn n (4.109)
∇wjwE(w j 1 K
[C. Bishop,
n=1 Pattern recognition and Machine learning, 2006]!
n=1
Application : détection d’objets simples!
Une fenêtre de taille 20x20 est glissée sur l’image. !
Les pixels d’une fenêtre sont donnés comme entrées!
4
p(x|C2 )
1.2
1
p(C1 |x) !
p(C2 |x)
3
0.6
2
(Bayésiens)!
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!
x x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
As can be seen from Figure 5.1, the neural network model comprises two stages
(Les biais ont été intégrés)!
of processing, each of which resembles the perceptron model of Section 4.1.7, and
for this reason the neural network is also known as the multilayer perceptron, or
MLP. A key difference compared to the perceptron, however, is that the neural net-
work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per-
...!
ceptron uses step-function nonlinearities. This means that the neural network func-
tion is differentiable with respect to the network parameters, and this property will
play a central role in network training.
If the activation functions of all the hidden units in a network are taken to be
linear, then for any such network we can always find an equivalent network without
hidden units. This follows from the fact that the composition of successive linear
transformations is itself a linear transformation.
Couche Couche However,
Coucheif the number of hidden
units is smaller thand’entrée!
either the number of input or output
cachée! units, then the transforma-
de sortie!
Quelques remarques!
• Le nombre d’unités dans la couche caché est adaptable.!
.!
..
5. NEURAL NETWORKS
Un exemple!
gure 5.4 Example of the solution of a simple two- 3
class classification problem involving
synthetic data using a neural network
having two inputs, two hidden units with 2
‘tanh’ activation functions, and a single
output having a logistic sigmoid activa- 1
tion function. The dashed blue lines
show the z = 0.5 contours for each of 0
the hidden units, and the red line shows
the y = 0.5 decision surface for the net- −1
work. For comparison, the green line
denotes the optimal decision boundary −2
computed from the distributions used to
generate the data.
−2 −1 0 1 2
symmetries, and thus any given weight vector will be one of a set 2M equivalent
weight[C. Bishop,
vectors . Pattern recognition and Machine learning, 2006]!
Similarly, imagine that we interchange the values of all of the weights (and the
bias) leading both into and out of a particular hidden unit with the corresponding
values of the weights (and bias) associated with a different hidden unit. Again, this
clearly leaves the network input–output mapping function unchanged, but it corre-
sponds to a different choice of weight vector. For M hidden units, any given weight
be viewed as performing a nonlinear feature extraction, and the sharing of features
between the different
evaluations,
generalization.
effort
Apprentissage : !
each ofoutputs
whichcan save on
would computation
require O(W )and can also
steps. lead the
Thus,
needed to find the minimum using such an approach would be O(W 3 ).
to improved
computational
Finally, we consider the standard multiclass
thatclassification problem in whichinformation.
each
Now
Because
compare
descente de gradient!
this with an algorithm makes use of the gradient
input is assigned to one of K mutually exclusive classes. The binary target variables
each evaluation of ∇E brings W items of information, we might hope to
tk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the network
find
La the are
minimum
fonction of the
d’erreur function
as est
yk (x,celle
inp(t
w) =de
O(W )1|x),
gradient evaluations. As:! we shall see,
outputs interpreted kla=régressionleading logistique
to the following error
by using error backpropagation, each such evaluation takes only O(W ) steps and so
function
the minimum can now be found in ! K 2 ) steps. For this reason, the use of gradient
NO(W
!
information forms theE(w) basis = kn ln yk (xn ,for
of−practical talgorithms w).training neural networks.
(5.24)
n=1 k=1
5.2.4 Gradient descent optimization
Minimisation itérative par descente de gradient (un pas dans la
The simplest approach to using gradient information is to choose the weight
direction du plus
update in (5.27) grand achangement)
to comprise small step in the:!direction of the negative gradient, so
that
w(τ +1) = w(τ ) − η∇E(w(τ ) ) (5.41)
where the parameter η > 0 is known as the learning After each
rate.(Learning
Vitesse rate)!such update, the
36
gradient is re-evaluated for the new weight vector and the process repeated. Note that
5. NEURAL NETWORKS
Possibilité
the error function de blocage
is defined dans un minimum
with respect local
to a training set, :!and so each step requires
thata the entire training set be processed
Figure 5.5 Geometrical view of the error function E(w) as
surface sitting over weight space. Point w is
A
E(w) in order to evaluate ∇E. Techniques that
use Atathe whole w , thedata set ofattheonce
local minimum and w is the global minimum.
any point
C
B
local gradient error are called batch methods. At each step the weight
vector is moved in the direction of the greatest rate of decrease of the error function,
surface is given by the vector ∇E.
eutsusing
ocess
mula of
arisesRetro-propagation du gradient (2)!
is the the
often various
with canonical
called units
the forward
where logistic
zi is thewilllink depend
propagation
sigmoid
activationas formula
the because
activation
of unit,arises
onaoutput-unit
the itparticular
orcan
function input, with
be ∂E regarded
together
that the
ninput
sends alogistic
withpattern
activation the cross
connection n. sigmoid
function. to unit To
j, activ
and wj
= δj zi .
rder
w
opy tohidden
of error keep
information function,the thenotation
through
and the associated
similarlynetwork. forentropy
uncluttered, the weuse
softmax error
shall function,
omit
activation the
function and
subscript5.1,similarly
together n sawpartial for the
shderfor
itsthe evaluation
isunits,
of
weight
the
we again make
with
with
that of
connection.
We respect to
the
∂w
asee
In
ji chain
weight
Section rule we for that biases can
ork matching
variables. cross-entropy
beFirst wederivative
included noteerror
in this thatsumof E
function.
with Ebynn depends
shallon
introducing
its matching now
an the extra how
unit,this
weight or
cross-entropy wji simple
only
input, with activation
error fixed
functio
ult
ed ts extends
of the
input various
Il manqueato units
theatmore Equation
will
lesj.complex
+1. dérivées
We (5.53)
depend
thereforesetting ontells
par dotheof
not us that
particular
multilayer
rapport need thedeal
aux
to required
input
feed-forwardpattern
paramètres
with derivative
biases networks.
despartial
n. explicitly. is obtained
couches The sumsimply by ism
in (5.48)
j to unit We can therefore
! apply the chain rule for
In atogeneral
der keep the
cachées. feed-forward
notation
the value
Rappelons
transformed ∂E network,
uncluttered,
nofale
by for
δnonlinearresult
calcul we ∂E
each
the unit
shall
unit extends
fait nat ∂a
computes
omit
par
activationtheune kthe to
output the
aunité
weighted
subscript
function end more sum
ofntothe
caché :!complex
of
weight
give the its by the
activation setting
valueof ofofzj
unit
give
uts of the formFirst δjwe ≡the = h(·) (5.55)z j
rk variables.
! in at
the ∂E
noteinput
form
n∂a that end
j ∂E
En !
n
depends
of∂a thej
In
∂a ona the
weight
k general
∂a weight w
(where
j
zfeed-forward
=
ji only
1 in the case network,
of a bias). each
Note tha
d input aj to unit j.the Wesame can= therefore
aj = askapply
form for .zthe
wjithe chainz rule
i simple =
linear for partial
). (5.50)
(5.48) at the start(5.49
inputs of the j form h(a j model considered of th
ive ∂wjiin order
Thus,
∂aj to∂w ji
i evaluate the derivatives, we need only to calculate !the
runs over all units
Note
∂En that k oneto orwhich
∂Encalculera∂a more ofunit
the j sends
variables z connections.
in the sum in (5.48) The couldarrange-
be an
= input, and
a j (5.53). w
j i
Pour
duce a useful notation simplifier, for = on
each hidden .inand d’abord
output unit les indérivées
the network, par rapport
(5.50) and then apply à !
nd weights similarly,
is∂willustrated
ji
the
∂aj ∂winunit (5.49) could be an
ji Figure 5.7. Note that the units labelled k
j output.
! For each pattern seen
As we have in thealready,
training for set, thewe shalloutput units, that
suppose we havewe have supplied i the
thera useful
uce hidden units
notation
corresponding and/or output
∂E n
δj ≡input vector to the network
units. In writing Pour la and down
calculated
couche (5.55),
de sortie the we areof all o
:! activations
(5.51)
he fact that the variations
hidden ∂E andnin ∂aajj give
output units in risethe to networkvariations
δkby =successive
yk −in tk the error func-
application of (5.48) and
δ ≡ (5.51)
gh variations(5.49). in the j This process is often called forward propagation because it can be regarded
variables
∂aj ak . If we now substitute the definition
are often referred as a forward flow of information we
to as errors for reasons through shallthesee shortly. Using
network.
5.51)
5. NEURAL
rewrite
intoNETWORKS
Chaque
often referred to as
(5.55),
dérivée
Now and
errors
peut
consider make être use
for reasons
calculée
the evaluation of (5.48)
we shall see
à partir
of the and des(5.49),
derivative
shortly.
dérivées wedeobtain la couche
Usingof En with respect to a weigh
the
propagation
write suivantew formula
:!
ji . The outputs∂aj of the various units will depend on the particular input pattern n
ure 5.7 Illustration of the∂a
However,
! backpropagation calculation of
j in order =δjtozfori .keep
hidden unit j by
the notation uncluttered, we (5.52)
shall omit the subscript n
unit j sends from
of the∂w= ji
δ’s
thejinetwork
connections.
from
zi . those!
variables.
units k to which
First we
zi (5.52)
the note that Enδjdepends on the weight wji only
δk
∂w ′The blue arrow denotes
5.51) and (5.52)
direction
via into
of information
the = flow
δj summed
(5.50),h (a we
duringj )then
input a tow
forwardobtain δ
propagation,
kj
unit k
j. We can
wji
therefore wkj
apply the chain (5.56)
rule for partia
51) andand (5.52) intoarrows
the red (5.50), we then
indicate obtain propagation
the backward j
derivatives to give
of error information. k zj
δ1
∂En∂En ∂En ∂En ∂aj
= δj z= i . δj zi .
= (5.53). (5.53) (5.50
that the value of ∂wjiδ∂w forji a particular hidden ∂w ∂a
ji [C. Bishop, unit ∂w
j Patterncan
ji be obtained
recognition by2006]!
and Machine learning,
provided we are using the canonical link as the output-unit activation function. To
Retour : généralisation, !
sélection de modèles!
Complexité du modèle de prédiction : !
• Nombre de couches cachées!
• Nombre d’unités cachées par couche!
5.5. Regularization in Neural Networks 257
1 M =1 1 M =3 1 M = 10
0 0 0
−1 −1 −1
0 1 0 1 0 1
Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The
graphs show the result
1 couche cachéeof fitting networks
avec M having
unités!M = 1, 3 and 10 hidden units, respectively, by minimizing a
sum-of-squares error function using a scaled conjugate-gradient algorithm.
Données générées avec!
of the form
! λ
E(w) = E(w) + wT w. (5.112)
2
[C. Bishop, Pattern recognition and Machine learning, 2006]!
This regularizer is also known as weight decay and has been discussed at length
Arrêt prématuré (early stopping)!
L’apprentissage est itérative (1 itération est appelée 1 « epoch »).!
On peut diminuer le sur apprentissage (overfitting) en arrêtant
l’apprentissage lorsque l’erreur sur
5.5.la base de validation
Regularization commence261
in Neural Networks à
augmenter.!
! 0.45
0.25
0.4
0.2
0.15 0.35
0 10 20 30 40 50 0 10 20 30 40 50
Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during a
typical training
Erreur sursession,
la baseasd’apprentissage!
a function of the iteration step, for the sinusoidal
Erreur data de
sur la base set.validation!
The goal of achieving
the best generalization performance suggests that training should be stopped at the point shown by the vertical
dashed lines, corresponding to the minimum of the validation set error.
Réseau
Geste!
de
neurones!
4
p(x|C2 )
1.2
1
p(C1 |x) !
p(C2 |x)
3
0.6
2
(Bayésiens)!
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!
x x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
|y − t|q
has just occurred,
receive no and if we
1 information. Ourknew that of
measure theinformatio
event w
Bases : théorie
receive
on the no de
probability
l’information!
information. Our measure
distribution
h(x) = − log p(x)
of information
p(x), and we therefo
(1.92)
onis the
a probability
monotonic
2
distribution
function of the p(x), and we p(x)
probability therefoan
Entropie h(x) d’une variable aléatoire x : mesure du contenu
iscontent.
where the negative sign ensures a monotonic
that information
The formfunction
isof of the
positive
h(·) or
canprobability
zero.
be Note
found that
bylow
p(x) and
notin
d’information.!
probability events x correspond to high information content.
content.
and y The
that0 are form of h(·)
unrelated, canThe
then be choice
the found ofbybasis
information noting
gai
−1 for the0être
Peut 1 isde
dérivé
logarithm deux2 variables
arbitrary, and for−2the −1 we shall
indépendantes
moment 0 adopt the
:! 1 convention 2
and
shouldy that be are
the unrelated,
sum of the then the information
information gained gain
from
prevalent in information theory of using logarithms to the base of 2. In this case, as
y − t y − t
should
h(x, y)be=are
the sum+of
h(x) the information
h(y). Two unrelated gained
events fromwille
we shall see shortly, the units ofq h(x) bits (‘binary digits’).
29 PlotsNow
of thesuppose
quantitythat
Lq = |yh(x,
− y)
sot| p(x,
a sender =y)h(x)
for various
wishes + h(y).
values
to=transmit ofthe
p(x)p(y). Two
From
q. value ofunrelated
a these
randomtwo events will
relationsh
variable to
a receiver.
ExerciseThe 1.28 somust
average amount p(x, y)
ofbe = p(x)p(y).
information
given by the From
that logarithm
they transmitthese twoprocess
ofinp(x)
the relationshi
and so iswe h
Exercise
obtained by 1.28 must beofgiven
taking the expectation (1.92)by withthe logarithm
respect of p(x) and
to the distribution p(x)soand
we ha
h(x) = − log2 p(x) (1.92)
is given by !
where the negative
52 1.sign
INTRODUCTION H[x] =
ensures that p(x) log2isp(x).
− information positive or zero. Note that (1.93)
low
probability events0.5 x correspond to high x information
0.5 content. The choice of basis
for
Thisthe logarithm
important is arbitrary,
quantity is called andthe forentropy
the moment of the we random shall variable
adopt thex. convention
Note that
in information theory of using logarithms to the base of 2. In this case, as
H = 1.77 H = 3.09
prevalent
lim p→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter a
we shall
value forsee shortly,
x such that the units
= 0.of h(x) are bits (‘binary digits’).
probabilities
probabilities
p(x)
Now
So farsuppose
we havethat
0.25
givena sendera rather wishes to transmit
heuristic
0.25
motivation the valuefor theofdefinitiona randomofvariable informa- to
a receiver. The average amount of information that they transmit in the process is
obtained by taking the expectation of (1.92) with respect to the distribution p(x) and
is given by 0
! 0
tree 1 Classification
extrêmement
rapide!!
𝑃
𝑃 (𝑐)
tree 1 tree 1
𝑃 𝑃
𝑃 (𝑐) 𝑃 (𝑐)
es. The
cates theyellowFigure
Plusieurs arbres
crosses Randomized
indicates
sont4.entrainés sur des Decision
FigureForests.
the sous-ensembles 4. Randomized
A forest
différents Decision
des is Fo
an ensemb
données!
et
redpixels
circles
Chaqueindicate
feuille the offset
of contient
trees. Each pixels
tree consists
une distribution ofoftrees.
split nodes
d’étiquettes! Each tree(blue)
consists
and leafof split
nod
ee atwo
large
! example(green).
features Thegivered
a large
arrows indicate
(green).
theThedifferent
red arrows
paths indicate
that might
theb
!
(b),
at new the sametaken
two features at new
by different trees for ataken
Addition des distributions correspondantes :!
particular
by different
input. trees for a particul
maller response.
3.3. Randomized decision
3.3. Randomized
forests decision fo
biguate
w the classifier Randomized
to disambiguate
decision trees
Randomized
and forestsdecision
[35, 30, 2,
trees
8] hav
an
e body. proven fast and effectiveproven
multi-class
fast and
classifiers
effectiveformulti-
man
, (1)pixel x, the features compute
a given 1 X
◆ ✓ ◆P (c|I, x) = all treesPint (c|I, the forest
x) . to give the final
(2) classification
u v T t=1
xparame-
+
dI (x) Apprentissage des paramètres!
dI x +
dI (x)
, (1)
alization (𝐼, x)Training. Each tree is trained on a different set of randomly
P (c|I, x) =
1 XT
T t=1
Pt (c|I, x) . (2)
depth at pixel x in image I, and parame-
h invari- synthesized Chaque images.arbre A random subset of
est appris de2000 example
manière pix-
indépendante !
cribe offsets u and v. The normalization Training.
ce1ensures
ee offset the features
els from each image
Apprentissage
are depth invari- is chosen
couche to
Each
ensure
par a
tree is
roughly
couche
trained
è
even
on
le
a
dis-
different
gradient
set
de
of randomly
x) synthesized images. A random subset of 2000 example pix-
camera.
t on the body, a tribution
fixed world across
space body
l’erreur n’estparts.
offset pas Each tree image
elsdisponible!!!
from each is trained usingtothe
is chosen ensure a roughly even dis-
dulo per-is closefollowing
𝑃! algorithm [20]: tribution across body parts. Each tree is trained using the
the pixel or far from the camera.
uskground
3D translation 𝑃invariant
1.(𝑐)Randomly
1. (modulo
Pour per-
propose
un couche a following
set of algorithm
splitting
donnée, [20]:
candidates
proposition d’un
= ensemble
ean dIoffset
(x ) pixel lies on
0
(✓,the
⌧ ) background
(feature
de candidats parameters 1. ✓Randomly
deand propose⌧ a). set of splitting candidates
thresholds
paramètres =
ds of4.the
gure image, theDecision
Randomized depth probeForests. dI (x 0
)
A forest is an ensemble ) (feature )!parameters ✓ and thresholds ⌧ ).
(✓,,⌧seuil
tive constant
trees. Each treevalue.2. Partition (choix
the élément
consists of split nodes (blue) and leaf nodes = {(I, x)} into left
set of examples Q
ocations 2. données
Partition
two features
een). at different
The red arrows andpixel
indicate right
2. the subsets paths
Séparation
locations
different by each
des : be thed’apprentissage
that might
set of examples Q en = {(I, x)} into left
2 parties!
rge pos- and right subsets by each :
ksenupwards:
by different Eq. 1 will
trees for agive a large
particular pos-
input.
dy, but a Q ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3)
ixels x near the top of the body,l but a Ql ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3)
3. Randomized
dy. pixels
for decision
Fea- x lower down the body. forests
Qr ( Fea-) = Q \ Ql ( ) Qr ( ) = Q \ Q(4)
! l( ) (4)
res such
dRandomized
help find thin decision
verticaltrees and forests
structures such[35, 30, 2, 8] have
oven fast and effective 3. Computemulti-class the classifiers
giving the 3.forlargest
many gain
Compute the in giving
information:
the largest gain in information:
sek features
ks signal
[20, 23, 36], and can
provide 3. Choix
onlybea implemented
weak?
des paramètres
signal efficiently on the ? candidats maximisant le gain
= = argmax G( (5) ) (5)
PU
f the
o, [34].
but inAsthe
body illustrated in Fig.en
pixel belongs to, inargmax
4, information
a forest
but is an G( )
(mesure
ensemble d’entropie)!
cision
T accu-foresttrees,
to decision they each
are sufficient
consisting to accu-
of split and leafX nodes. X |Qs ( )|
allsplit
trained
nodeparts. Theofdesign |Q ( )| H(Qs ( )) (6)
ch
of these consists G( ) of=these
a feature and a thresholdG(⌧ . ) s = H(Q
f✓ H(Q) H(Q) ( )) (6)
s |Q|
yclassify
onal
motivated by their computational effi-
effi-pixel x in image I, one starts at the roots2{l,r} and re- |Q| s2{l,r}
essing
atedly isevaluates
needed; Eq.each1,feature
branching need leftonlyor right according
eed
ethe only
pixels and perform atwhere
most 5Shannon
where Shannon entropy H(Q) ... is computed on the nor-
Shannon entropie!
comparison to threshold ⌧arithmetic
. At theentropy
leaf nodeH(Q) reachedis computed on the nor-
malized histogram of body part labels lI (x) for all
ithmeticcan be straightforwardly
features 4. PRécursion!
imple-
tree t, a learned distribution
malized thistogram of body
(c|I, x) over body partpart
la- labels l (x) for all
- “R”: Right hand is used for perform the gestures.
pman Andrew Blake - “L”: Left hand is used for perform the gestures.
- “B”: Both hands are used for perform the gestures.
dge Application
& Xbox Incubation: estimation de la pose After that, we can collect data of a desired class everytime we want by pressing th
button on your computer keyboard. The buttons for collect each class data are:
- “1“: Stop
- “2”: Scroll: left-to-right
- “3”: Scroll: right-to-left
While the button is pressed , the data is stored for each frame and each training ve
automatically labeled as we indicate the label by pressing the button. So, this proc
truth collection because we gather the proper objective data for the training.
Once we have new data stored, we can update the neural network with a supervise
have the labels corresponding to its respective feature vectors.
Taking into account that the new data is a ground truth set, we can not update the
network because all data recorded is labelled as a gesture. As we need gesture sam
no-gesture samples to train the Filter neural network, the only classifier that is tra
neural network.
4
p(x|C2 )
1.2
1
p(C1 |x) !
p(C2 |x)
3
0.6
2
(Bayésiens)!
0.4
p(x|C1 )
1
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!
x x
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
Wikipedia!
Google fight !
(24.7.2014)!