You are on page 1of 63

M2R IGI!

Cours « Méthodes avancées en Image et Vidéo » !

!
Machine Learning 1!
!
Christian Wolf!
INSA-Lyon!
LIRIS!
Méthodes avancées en image et vidéo!
C. Wolf! Doua! Machine Learning 1 (introduction, classification)!
C. Wolf! Doua! Machine Learning 2 (features, deep learning,
modèles graphiques)!
M. Ardabilian! ECL! Indexation par le contenu : évaluation des techniques
et systèmes à base d’images!
M. Ardabilian! ECL! Acquisition multi-image et applications!
M. Ardabilian! ECL! Approches de super-résolution et applications!
M. Ardabilian! ECL! Méthodes de fusion en imagerie!
M. Ardabilian! ECL! Détection, analyse et reconnaissance d’objets!
J. Mille! Doua! Méthodes variationnelles!
M. Ardabilian! ECL! La biométrie faciale!
E. Dellandréa! ECL! Le son au sein de données multimédia : codage,
compression et analyse!
S. Duffner! Doua! Suivi d’objets!
E. Dellandréa! ECL! Un exemple d'analyse audio : Reconnaissance de
l'émotion à partir de signaux de parole et de musique!
S. Bres! Doua! Indexation d’images et de vidéos!
V. Eglin! Doua! Analyse de documents numériques 1!
F. LeBourgeois! Doua! Analyse de documents numériques 2!
Sommaire!
1 Introduction!
0 –  Principes générales!
1
–  Fitting et généralisation, complexité des modèles!
0 1
–  Minimisation du risque empirique!
x

cope of this book.


Classification supervisée!
eeds its own tools and techniques, many of the
ommon to all such problems. One of the main
in a relatively informal way, several of the most
–  KPPV (k plus proches voisins)!
illustrate them using simple examples. Later in
deas re-emerge in the context of more sophisti-
eal-world pattern recognition applications. This
–  Modèles linéaires pour la classification!
ed introduction to three important tools that will
y probability theory, decision theory, and infor-
ght sound like daunting topics, they are in fact
–  Réseaux de neurones!
anding of them is essential if machine learning
ect in practical applications. –  Arbres de décisions et forêts aléatoires!
rve Fitting –  SVM (Support Vector machines)!
egression problem, which we shall use as a run-
er to motivate a number of key concepts. Sup- Extraction de caractéristiques!
variable x and we wish to use this observation to
rget variable t. For the present purposes, it is in-
mple using synthetically generated data because –  Manuelle!
at generated the data for comparison against any
xample is generated from the function sin(2πx)
rget values, as described in detail in Appendix A.
–  ACP (Analyse de composantes principales)!
a training set comprising N observations of x,
r with corresponding observations of the values
ure 1.2 shows a plot of a training set comprising
–  Réseaux de neurones convolutionnels !
ta set x in Figure 1.2 was generated by choos-
spaced uniformly in range [0, 1], and the target
puting the corresponding values of the function
(« deep learning »)!
Quelques sources!
Christopher M. Bishop!
“Pattern Recognition and Machine Learning”!
Springer Verlag, 2006!

Microsoft Research, !
Cambridge, UK  

Kevin P. Murphy, “Machine Learning”!


MIT Press, 2013!

Google!
Formerly : University of British Columbia, !
 
Reconnaissance de formes!
Classification supervisée!
Luc!

Boite!

Lego!

Reconnaissance d’objets à partir d’images:!


•  Chiffres, lettres, logos!
•  Biométrie: visages, empreintes, iris, …!
•  Classes d’objets (voitures, arbres, bateaux, …)!
•  Objects spécifiques (une voiture, un mug, …)!
Prédiction!
De manière générale, nous aimerions prédire une valeur t à partir
d’une observation x!
!
!
!
Si t est continue : #régression!
Si t est discret : # #classification!
!
Apprentissage des paramètres à partir de données.!
Apprentissage et généralisation!
Apprendre à classifier des données revient à apprendre une fonction
de décision : la frontière entre les classes.!
Apprentissage et généralisation!
La complexité d’une fonction de décision dépend de la complexité du
regroupement des étiquettes dans l’espace de caractéristiques
(feature space)!

1   4  
5  
0  
2   3  
7   9  
4  
6  
Fitting et Généralisation!
•  Données générées par une fonction!
•  Objectif : supposant la fonction inconnue, prédire t à partir de x!

ng data set of N =
wn as blue circles,
ng an observation
1
riable x along with
ding target variable t
curve shows the
πx) used to gener-
Our goal is to pre- 0
of t for some new
hout knowledge of
e.
−1

0 x 1

[C. Bishop, Pattern recognition and Machine learning, 2006]!


ment lies beyond the scope of this book.
ial function of the form manner,
and quantitative and decision theory, discussed in Section 1.5, allows us to
exploit this probabilistic representation in Morder to make predictions that are optimal
according
y(x, w) = wto Fitting et Généralisation!
0 +appropriate
w1 x + w2 x2 + criteria.
. . . + wM xM =
For the moment, however, we shallj =0
"
wj xj (1.1)
proceed rather informally and consider a
is simple
the orderapproach based and
of the polynomial, on xcurve
j fitting.
denotes Intoparticular,
x raised the power ofwe
j. shall fit the data using a
« Fitting »
polynomial wd’un
coefficientsfunction polynôme
of the formd’ordre M!
nomial 0 , . . . , wM are collectively denoted by the vector w.
, although the polynomial function y(x, w) is a nonlinear function of x, it
r function of the coefficients w. Functions, such as the polynomial, which "M
in the unknown parameters
y(x, w) have 0 + w1 x + w2 x + . . . + wM x
= wimportant properties 2and are called linear M
= wj xj (1.1)
nd will be discussed extensively in Chapters 3 and 4.
j =0
values of the coefficients will be determined by fitting the polynomial to the
data. This can be done by minimizing an error function that measures the
tweenwhere
the M is the
Critère
function 6des
y(x,order forofany
thegiven
w),1.« moindres
INTRODUCTION polynomial, of w, and
valuecarrées » and the denotes
xj training
(des set x raised to the power of j.
erreurs)!
ts. The polynomial
One simple choice ofcoefficients
error function, w which
0 , . . . is
, w widely
M are
used, is given by denoted by the vector w.
collectively
of the squares
Note the errorsFigure
that,ofalthough between the predictions
the polynomial function
sponds to (one half
y(xn , ofw)
1.3 The error function (1.2) corre-
of) the sum
for w)
y(x,
t
eachisdata
a nonlinear tfunction
n of x, it
and the corresponding target valuesthe , so that
tnsquares wedisplacements
of the minimize
is a linear function of the (shown by the vertical w.
coefficients green Functions,
bars) such as the polynomial, which
of each data point from the function
are linear in the unknown
1"
N
parameters
y(x, w).
2
have important properties and y(x are, w)called linear
models andE(w) will=be2 discussed
{y(xn , w) − tn } in Chapters 3(1.2)
n
extensively and 4.
n=1
The values of the coefficients will be determined by fitting the polynomial to the
e factor of 1/2data.
training is included
This for
canlater
beconvenience.
done by minimizing We shall discuss the mo-function that measures the
an error
or this choice of error function later in this chapter. For the moment we
misfit between the function y(x, w), for any given value of w, and the xtraining set
ote that it is a nonnegative quantity that would be zero if, and only if, the x
data points. One simple choice of error function, which is widely used, is given by n

Dérivée linéaire -> solution directe!


the sum of the squares of function the errors between
y(x, w) were thethrough
to pass exactly predictions y(x
each training data n , w)
point. for each data
The geomet-
rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.
point xn and the corresponding target values tn ,[C.soBishop, thatPattern
we minimize
recognition and Machine learning, 2006]!
We can solve the curve fitting problem by choosing the value of w for which
Sélection d’un modèle!
Quel ordre M pour le polynôme?!
1.1. Example: Polynomial Curve Fitting 7

Underfitting! Underfitting!
1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

Overfitting!
1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

[C. Bishop, Pattern recognition and Machine learning, 2006]!


Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in
Figure 1.2.
0

Sélection d’un modèle!


−1
Séparation des données en deux parties :!
•  Base d’apprentissage (Training set)!
• 1 Base de0 test (test set)!
x 1. INTRODUCTION x 1

ous orders M , shown as red curves, fitted to the data set shown in
Figure 1.5 Graphs of the root-mean-square
error, defined by (1.3), evaluated 1
on the training set and on an inde- Training
pendent test set for various values Test
Root Mean
of M . Square Error (RMS)!
by !
ERMS = 2E(w⋆ )/N ERMS 0.5
(1.3)
n by N allows us to compare different sizes of data sets on
the square root ensures that ERMS is measured on the same
me units) as the target variable t. Graphs of the training and
are shown, for various values of M , in Figure 0 1.5. The test
0 3 M 6 9
e of how well we are doing in predicting the values of t for
s of x. We note from Figure 1.5 that small values of M give
s of the test setForerror,
M =and thistraining
9, the can besetattributed tozero,
error goes to the fact
as wethat might expect because
lynomials this
arepolynomial contains 10
rather inflexible degrees
and are of freedom corresponding
incapable of capturing to the 10 coefficients
[C. Bishop, Pattern recognition and Machine learning, 2006]!
Validation croisée!
La séparation des données en deux parties change itérativement. La
mesure de performance est la moyenne sur toutes les itérations.!

1.4. The Curse of Dimensionality 33

cross-validation, illus- run 1


of S = 4, involves tak-
nd partitioning it into S
case these are of equal run 2
groups are used to train
hen evaluated on the re- run 3
cedure is then repeated
s for the held-out group, run 4
blocks, and the perfor-
runs are then averaged.

nce. When data is particularly scarce, it may be appropriate


= N , where N is the total number of data points,
[C. which gives
Bishop, Pattern recognition and Machine learning, 2006]!
Big Data!!
Le problème de l’overfitting diminue en augmentant la taille de la base
d’apprentissage.!
1.1. Example: Polynomial Curve Fitting 9

1 N = 15 1 N = 100
t t

0 0

−1 −1

0 x 1 0 x 1

Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9
polynomial
M=9! for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the
size of the data set reduces the over-fitting problem.

ing polynomial function matches each of the data points exactly, but between data
[C. Bishop, Pattern recognition and Machine learning, 2006]!
points (particularly near the ends of the range) the function exhibits the large oscilla-
tions observed in Figure 1.4. Intuitively, what is happening is that the more flexible
Régularisation!
may wish to use relatively complex and flexible models. One technique that is often
used to control the over-fitting phenomenon in such cases is that of regularization,
which involves adding a penalty term to the error function (1.2) in order to discourage
Ajouter un terme supplémentaire : restriction des paramètres du
the coefficients from reaching large values. The simplest such penalty term takes the
formmodèle
of a sum! of squares of all of the coefficients, leading to a modified error function
of the form Paramètre de régularisation!
"N
! 1 2 λ
E(w) = {y(xn , w) − tn } + ∥w∥2 (1.4)
2 2
n=1
10 1. INTRODUCTION w0 est souvent exclu !
where ∥w∥ ≡ wT w = w02
2
+ w12 + . . . + wM 2
, and the coefficient λ governs the rel-
ative importance of the regularization term compared with the sum-of-squares error
term. 1Note that often the lncoefficient λ = −18
w0 is1omitted from the regularizer ln λ = 0
because its
inclusion
t
causes the results to depend on the t
choice of origin for the target variable
(Hastie et al., 2001), or it may be included but with its own regularization coefficient
(we shall0 discuss this topic in more detail in0 Section 5.5.1). Again, the error function
in (1.4) can be minimized exactly in closed form. Techniques such as this are known
in the−1statistics literature as shrinkage methods −1
because they reduce the value of the
coefficients. The particular case of a quadratic regularizer is called ridge regres-
sion (Hoerl 0 and Kennard, 1970). xIn the 1 context 0 of neural networks, this x
approach
1 is
known as weight decay.
Figure
M=9!1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
Figure
function 1.7
(1.4) shows
for two values theof theresults of parameter
regularization fitting the polynomial
λ corresponding of−18
to ln λ = order
and lnMλ = 0. =The9 to the
samecasedata
of noset as before
regularizer, i.e., λ =but now using
0, corresponding to lnthe regularized
λ = −∞, [C.
error
Bishop,
is shown at the bottomfunction
Pattern recognition
given
and
right of Figure bylearning,
Machine
1.4. (1.4).2006]!
La classification supervisée!
Nouvelle entrée!
Classe!
Classifieur!

Base d’apprentissage (étiquetée)!


Modèle!

Algorithme
d’apprentissage!
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x) !
p(C2 |x)

Classifieurs génératifs linéaires


0.8
class densities

3
0.6
2

(Bayésiens)!
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

!
x x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.

1994; Tarassenko, 1995). Classifieurs discriminatifs linéaires!


be of low accuracy, which is known as outlier detection or novelty detection (Bishop,

However, if we only wish to make classification decisions, then it can be waste-


ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities

!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a

Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities

!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected

tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using

SVM (Support Vector Machines)!


1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
“k-plus proches voisins”!
Probablement le classifieur le plus simple.!
La base d’apprentissage (vecteurs et étiquettes ) est stocké.!
Pour un nouveau vecteur , la donnée la plus proche est
cherchée et son étiquette sert comme prédiction.!

où  
Dim  2  

“k-plus proches voisins”!

1   4  
5  
0  
2   3  
7   9  
4   Nouvelle  
donnée  
6   Dim  
1  
Optimal dans la limite d’un nombre infini de données d’apprentissage.!
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x)
!
p(C2 |x)

0.8
class densities

2
p(x|C1 )
0.6

0.4
Classifieurs génératifs (Bayésiens)!
1
0.2

0
0 0.2 0.4
x
0.6 0.8 1
0
0 0.2 0.4
x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
0.6 0.8
! 1

plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be waste-

!
ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
Réseaux de neurones!
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
!
tree 1
Forêts aléatoires!
KERNEL MACHINES 𝑃

stration of the ν-SVM for re-


ession applied to the sinusoidal
𝑃 (𝑐)
!
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using
1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
SVM (Support Vector Machines)!
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
Modélisation probabiliste!
Supposons un problème de classification en deux classes et!
à partir de données 1D modélisées par une variable aléatoire !
(L’observation)! 44 1. INTRODUCTION
!
Modélisation probabiliste :! 5

! p(x|C2 )
4
!
!

class densities
3
! Probabilité postérieure!
! 2
A notre disposition : la vraisemblance ! p(x|C1 )
des données (likelihood)! 1

0
0 0.2 0.4 0.6 0.8 1
x
[C. Bishop, Pattern recognition and Machine learning, 2006]!
ht expect probabilities to play a role in making decisions. When we ob
y image x for a new patient, our goal is to decide which of the two cl
Modélisation Bayesienne!
gn to the image. We are interested in the probabilities of the two classe
mage, which are given by p(Ck |x). Using
Pour obtenir la probabilité postérieure,
Bayes’ theorem, these prob
il faut inverser le modèle à l’aide
be expressed inde
de la réglé theBayes
form:!
Vraisemblance! Probabilité a priori!
!
p(x|Ck )p(Ck )
! p(Ck |x) = .
! p(x)
44 1. INTRODUCTION
Peut être ignorée pour la maximisation!
La probabilité a priori modélise!
that any of the quantities appearing in Bayes’ theorem can be obtain 44 1. INTRODUCTION

nos connaissances sur!


oint distribution
5
p(x, Ck ) by either
1.2
marginalizing
p(C1 |x) or
p(C2conditioning
|x) with re 5

p(x|C2 )
1

le résultat, indépendamment!
We can now1 interpret p(Ck ) as the prior probability
4
p(x|C2 )
ppropriate
4 variables. 0

class densities
des observations .! 3

Ck , and p(Ck |x) as the corresponding


0.8 posterior probability. Thus p(C1 2
0
class densities

Dans
3 les cas simples, elle! p(x|C1 )
0

s the probability that a person has0.6cancer, before we take the X-ray measu 1
0

est tabulée, e.g. !


larly,2 p(C1 |x) is the corresponding probability, revised using Bayes’ the 0
0 0.2 0.4
x
0.6 0.8 1

of the information
p(x|C )
contained in0.4the X-ray. If our aim is to minimize the
1
Figure 1.27 Example of the class-conditional densities for tw
plot) together with the corresponding posterior probabilities (r
class-conditional density p(x|C1 ), shown in blue on the left plot,
vertical green line in the right plot shows the decision bounda

signing x to the wrong class, then


1
0.2 intuitively we would choose Ici :!
the class
rate.

be of low accuracy, which is known as

higher00 posterior
0.2 0.4
probability.
0.6 0.8 1
We now
0
0
show
0.2 0.4
that0.6
this 0.8
intuition
1
is correct, 1994; Tarassenko, 1995).
However, if we only wish to make
ful of computational resources, and ex

discuss more general criteria for making[C.decisions.


distribution p(x, Ck ) when in fact we
x x p(Ck |x), which can be obtained direc
Bishop, Pattern recognition and Machine learning, 2006]! conditional densities may contain a lo
terior probabilities, as illustrated in F
Génération de nouvelles données!
Cette famille de modèles porte le nom « modèles génératifs » parce
qu’elle permet de générer des nouvelles données respectant le
modèle.!
!
1.  On échantillonne une classe selon la distribution a priori !
e.g. !
!
!
!
2.  Pour la classe donnée, on échantillonne l’observation selon la
vraisemblance ! 44 1. INTRODUCTION

e.g.! 5 1.2
p(C1 |x) p(C2 |x)
p(x|C2 )
1
4

0.8
class densities

3
0.6
2
0.4
p(x|C1 )
1
0.2

0 [C. 0Bishop, Pattern recognition and Machine learning, 2006]!


0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Probabilité d’erreur!
La probabilité de commettre une erreur d’un certain type peut être
calculée de manière directe :!
1. INTRODUCTION
!
! x0 !
x
! p(x, C1 )
!
p(x, C2 )
!
!
!
!
x
!
R1 R2
!
Figure 1.24 Schematic illustration of the joint probabilities p(x, Ck ) for each of two classes plotted
En changeant
against le seuil
x, together with de décision
the decision boundary,xla = zone
b. Values
x of x ! x
rouge best compressible.!
are classified as
class C2 and hence belong to decision region R2 , whereas points x < x b are classified
! as C1 and belong to R1 . Errors arise from the blue, green, and red regions, so that for
x<x b the errors are due to points from class C2 being misclassified as C1 (represented by
! the sum of the red and green regions), and conversely for points in the region x ! x b the
[C. Bishop, Pattern recognition and Machine learning, 2006]!
errors are due to points from class C1 being misclassified as C2 (represented by the blue
! region). As we vary the location x b of the decision boundary, the combined areas of the
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x) !
p(C2 |x)

Classifieurs génératifs linéaires


0.8
class densities

3
0.6
2

(Bayésiens)!
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

!
x x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.

1994; Tarassenko, 1995). Classifieurs discriminatifs linéaires!


be of low accuracy, which is known as outlier detection or novelty detection (Bishop,

However, if we only wish to make classification decisions, then it can be waste-


ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities

!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a

Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities

!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected

tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using

SVM (Support Vector Machines)!


1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
namely those for which the decision surfaces ∥w∥are hyperplanes. To simplify the dis-
input vector we
cussion, x and assignsfirst
consider it totheone of K
case of two classes and then investigate the extension
Multiplying both sides of this result by w T
and adding w , and making use of y(x) =
to
wT xare
aces
K
+w Modèles linéaires (2 classes)!
ll restrict attention to linear discriminants,
> 2 classes.
0 and y(x⊥To
hyperplanes. ) =simplify
wT x⊥ the +w 0 = 0, we have
dis-
0

classes 4.1.1
and then Two investigateclasses the extension
y(x)
Modèle linéaire pour la classification! r= . (4.7)
The simplest representation of a linear ∥w∥ discriminant function is obtained by tak-
ingUne fonction
a linear function de décision
of the input estvector
modélisée
so that par une fonction paramétrique :!
This result is illustrated in Figure 4.1.
r discriminant function is obtained by tak-
As with the linear regression y(x) = in
models x + w0 3, it is sometimes convenient
wTChapter (4.4)
that
o use a more compact notation in which we introduce an additional dummy ‘input’
T where
x +xw0 0=
wvalue w isand
est1un
called
then
biais
a weight
quidefine
peut être w!vector,
= (w and
(4.4)
intégré 0dans 0 is x
, w)wles
and a bias
!
autres (x(not
= paramètresto bethat
0 , x) so en
confused
ajoutant la
with bias in
the constante
statistical« 1 »
sense). The negative of the bias is sometimes called a threshold. An
aux entrées!
0 is input (not toxbeisconfused
a biasvector assignedwith to class
bias Cin
y(x) =y(x)
1 if w
! Tx!!. 0 and to class C2 otherwise. The cor-
(4.8)
biasresponding
is sometimes decision
called boundary
a threshold. is therefore
An defined by the relation y(x) = 0, which
Interprétation : !
In
x) this
! case,to the
0 and
corresponds todecision
class Ca2(D −surfaces
otherwise. Theare
1)-dimensionalcor-D-dimensional hyperplanes
hyperplane within passing through
the D-dimensional input
!
ehe originby
defined
space. of the
ConsiderD +two
therelation 1-dimensional
points
y(x) = 0, xAwhichexpanded
and xB both input space.lie on the decision surface.
of which
yperplane
Because within
y(xAthe ) =D-dimensional
y(xB ) = 0, we input
have wT (xA − xB ) = 0 and hence the vector w is
4.1.2
of whichMultiple
bothorthogonal lie
to on
every classes
thevector
decision surface.
lying within the decision surface, and so w determines the
T
(xorientation
ANow ) = 0ofand
− xBconsider the hence
the the vector
extension
decision w isSimilarly,
of linear
surface. discriminants K >on2 the
if x is atopoint classes. We surface,
decision might
bedecision
tempted
then y(x)surface,
be=to0,and
build
and awK-class
so so determines
the normal the
discriminant
distance from by combining
the origin to a number of two-class
the decision surface is
rly, if x
La
discriminant
given is a point
constante
by functions.on
« 1 » the decision
ajoute surface,
However, this leads to some serious difficulties (Duda and
e from1973)
Hart, the origin
un « biais »
as we to now
au the decision
modèle
show. surfacewisT x
!
w0
Consider the use of K −1 classifiers =−
each of∥w∥
which. solves a two-class problem(4.5) of
w ∥w∥
0
target vector t. One justification for using least squares in such a context is that it
for every possible
approximates pair of classes.
the conditional This is
expectation known
E[t|x] as atarget
of the values givenclassifier.
one-versus-one the input Each
point isFor
vector. thentheclassified according
binary coding to athis
scheme, majority vote amongst
conditional expectation theisdiscriminant
given by thefunc-
tions. of
vector Modèles linéaires (K classes)!
However,
posteriorthis tooprobabilities.
class runs into theUnfortunately,
problem of ambiguous
however, theseregions, as illustrated
probabilities
in the
are typically right-hand diagramrather
approximated of Figure 4.2.
poorly, indeed the approximations can have values
outside Wethecan
Plusieurs
avoid
range (0, these
1), due
fonctions
difficulties by considering
to the limited
paramétriques flexibility
:! of a linear
singlemodel
K-class discriminant
as we shall
comprising
see shortly. K linear functions of the form
!
Each class Ck is described by its own linear model so that
! yk (x) = wk x + wk0
T
(4.9)
yk (x) = wkT x + wk0 (4.13)
!
and then assigning a point x to class Ck if yk (x) > yj (x) for all j ̸= k. The decision
boundary
where = between
1, . . . ,vectorielle,
Enk notation K.class Ck and
We can class
conveniently
et en is therefore
Cj group
intégrant lethese given
biaistogether
:! by yk (x)
using = ynota-
vector j (x) and
tion
hence corresponds to a (D − 1)-dimensional hyperplane defined by
so that
y(x) = W " Tx # (4.14)
(wk − wj )T x + (wk0 − wj 0 ) = 0. (4.10)
This has the same form as the decision boundary for the two-class case discussed in
Interprétation
Section 4.1.1, and so analogous geometrical properties apply. : !
The decision regions of such a discriminant! are always singly connected and
convex. To see this, consider two points xA and x! B both of which lie inside decision
region Rk , as illustrated in Figure 4.3. Any point x
! that lies on the line connecting
« The winner takes it all »!
xA and xB can be expressed in the form
!
! = λxA + (1 − λ)xB
x (4.11)
Problème simple (1D, 3 classes)!
Classifieur linéaire : entrées 1D, 3 classes!

Entrée :! Paramètres :!

Sortie d’une classe :!

Sortie de toutes les classes :!


Problème simple (1D, 3 classes)!

Données d’apprentissage!
Frontières « réelles » entre classes!
Frontières estimées!
Le cas non-linéaire!
Prétraitement : transformation non-linéaire des données !
(à choisir préalablement selon l’application) :!
204 4. LINEAR MODELS FOR CLASSIFICATION

1
1

φ2
x2

0 0.5

−1
0

−1 0 x1 1 0 0.5 φ1 1

Figure 4.12 Illustration of the role of nonlinear basis functions in linear classification models. The left plot
Fonctions de bases Gaussiennes!
shows the original input space (x1 , x2 ) together with data points from two classes labelled red and blue. Two
‘Gaussian’ basis functions φ1 (x) and φ2 (x) are defined in this space with centres shown by the green crosses
and with contours shown by the green circles. The right-hand plot shows the corresponding feature space
(φ1 , φ2 ) together with the linear decision boundary obtained given by a logistic regression model of the form
discussed in Section 4.3.2. This corresponds to a nonlinear decision boundary in the original input space,
shown by the black curve in the left-hand plot.
[C. Bishop, Pattern recognition and Machine learning, 2006]!
an understanding of their more complex
4.3.2 Logistic regression
We begin our treatment of generalized linear models by considering the problem
Régression logistique (2 classes)!
of two-class classification. In our discussion of generative approaches in Section 4.2,
we saw that under rather general assumptions, the posterior probability of class C1
can be Modèle writtenlinéaire
as a logistic sigmoid acting
(sur entrées on a linear
transformées) + function of theàfeature
non-linéarité vector
la sortie!
linear
φ so ! that models by considering the problem ! T "
n of generative
! approachesp(C1 |φ) = iny(φ)Section
= σ w4.2, φ (4.87)
ions,
with ! p(C the2 |φ)
posterior
= 1 − p(Cprobability
1 |φ). Here σ(·) of class
is the C1sigmoid function defined by
logistic
(4.59). In the terminology of
Modélisationstatistics,
directe de this
la model is known as logistic regression,
on a ! linear function of
probabilité the feature
postérieure! vector
although it should be emphasized that this is a model for classification rather than
ROBABILITY
regression.! ! DISTRIBUTIONS
"
) = For σ est wM
an T
la -dimensional
φ feature space
fonction logistique φ, this (4.87)
(« sigmoïde ») model has M adjustable
assurant parameters.
que les sorties sont
Bywhich
contrast,
entre if
we0can we
1! had
et solve fitted
for µ Gaussian
to give µ = σ(η), class
whereconditional densities using maximum
s likelihood,
the logistic we would
sigmoid have used 2M parameters
function defined forbythe means and M (M + 1)/2
parameters for the (shared) covariance 1 prior p(C1 ),
σ(η) = matrix. Together with the class (2.199)
isthismodel
gives a total is known as logistic
of M (M +5)/2+1 1 + regression,
exp(−η)which grows quadratically with M ,
parameters,
s iniscontrast
isacalled modelthe thefor
to logistic
linearclassification
dependence
sigmoid function. on Thus Mrather
weofcan than
thewrite
number of parameters
the Bernoulli in logistic
distribution
regression. For large
using the standard values of (2.194)
representation M , therein theis form
a clear advantage in working with the
logistic regression model directly.
, thisWemodel now use has M adjustable
maximum p(x|η) = σ(−η)
likelihood parameters.
to exp(ηx) the parameters of
determine (2.200)
the logistic
ass conditional
regression
where wemodel. To
have used densities
1do this, =
− σ(η) using
weσ(−η),
shall make
which maximum
use
is of the
easily derivative
proved of theCom-
from (2.199). logistic sig-
moidparison function, whichshows
with (2.194) can conveniently
that be expressed in terms of the sigmoid function
meters
itself for the means and M (M + 1)/2
4.3. Probabilistic Discriminative Models 209

Régression logistique (K classes)!


4.3.4 Multiclass logistic regression
4.3.4
In Multiclass
our discussion logistic models
of generative regression
for multiclass classification, we have
seen In
thatour
fordiscussion
a large class
of of distributions,
generative the for
models posterior probabilities
multiclass are given
classification, we by a
have
softmax
seen thattransformation
Extension for similaire ofau
a large class linear
of cas functions
linéaire!of
distributions, thethe feature variables,
posterior so that
probabilities are given by a
softmax transformation of linear functions of theexp(a feature) variables, so that
k
p(Ck |φ) = yk (φ) = ! (4.104)
exp(a
exp(a )
kj ) « Softmax » pour assurer que
p(Ck |φ) = yk (φ) = !j (4.104)
les sorties sont des
j exp(a j )
where the ‘activations’ ak are given by probabilités !

where the ‘activations’ ak are given by


ak = wkT φ. Linéarités! (4.105)
ak = wkT φ. (4.105)
There we used maximum likelihood to determine separately the class-conditional
densities
There weand the maximum
used class priorslikelihood
and then found the corresponding
to determine separatelyposterior probabilities
the class-conditional
using Bayes’
densities andtheorem,
the class thereby
priors andimplicitly
then founddetermining the parameters
the corresponding posterior }. Here we
{wkprobabilities
consider the use of maximum likelihood to determine the parameters {wk } of this
using Bayes’ theorem, thereby implicitly determining the parameters {wk }. Here we
model directly. To do this, we will require the derivatives of yk with respect to all of
consider the use of maximum likelihood to determine the parameters {wk } of this
the activations aj . These are given by
model directly. To do this, we will require the derivatives of yk with respect to all of
the activations aj . These are given∂yk by
= yk (Ikj − yj ) (4.106)
∂a
∂ykj
= yk (Ikj − yj ) (4.106)
∂a
where Ikj are the elements of the identityj matrix.
Next we write down the likelihood function. This is most easily done using
where Ikj are
the 1-of-K the elements
coding scheme of in the identity
which matrix.vector t for a feature vector φ
the target n n
Next we write down the likelihood function. This is most easily done
belonging to class Ck is a binary vector with all elements zero except for element k, using
the 1-of-K
which equalscoding
one. Thescheme in which
likelihood the is
function target
then vector
given bytn for a feature vector φn
Régression logistique : apprentissage!
Un ensemble de données d’apprentissage est supposé connu :!
Les entrées , transformées par les fonctions des bases!
Les sorties sont connues, codées en « 1-à-K » !
(« hot-one-encoded ») :!
!
!
! Classe « réelle » de l’échantillon n!
!
!
!
!
Objectif : apprendre les paramètres selon un critère d’optimalité!
!
!
∂aj ∂ykk kj j

∂y∂a k
= yk (Ikj − yj ) (4.106)
j= yk (Ikj − yj ) (4.106)
where Ikj are the elements of the identity ∂aj matrix.
Apprentissage des paramètres!
Next
the 1-of-K
we Iwrite
where
where Nextkj are
Icoding wethe
down the likelihood function. This is most easily done using
kj are the elements of the identity matrix.
elements
scheme
write ofthe
in which
down thelikelihood
identity
the target matrix.
vector tn This
function. for aisfeature
most easily vectordone
φn using
belongingtheNext
1-of-Kwe write
to class k is adown
Ccoding binary
scheme thevector
likelihood
in which with all function.
the elements This
target vectorzero isexcept
tnmost aeasily
for for done
element
feature k,usingφ
vector n
which thebelonging
1-of-K
equals
La vraisemblance one.coding
The
des scheme
likelihood
données in which
function
est the
is
la target
then given
probabilité vector
by t for
d’obtenir
to class Ck is a binary vector with all elements zero except for element k,
n a feature
les vector φ n
belonging
observations which étant to
equals class CkThe
one.
donné is ales
binary
likelihood vector
paramètres with all
function elements
isetthen
lesNgiven zero
by except
entrées :! for element k,
N K K
which equals one. The likelihood " "function is then given "" by t
p(T|w1 , . . . , wK ) = "Np(C " k
K |φ n )tnk
= "y
N
nk
nk"K (4.107)
k"N " K k"N " K tnk
p(T|w1 , . . . , wnK=1 )= =1 p(Ck |φntnk )ntnk
=1 = =1
tnkynk (4.107)
p(T|w1 , . . . , wK ) = n=1 p(Ck |φn ) = n=1 ynk (4.107)
where ynk = yk (φn ), and T is an N n=1 ×k=1Kk=1 matrix of target nvariables
=1 k=1
k=1
with elements
tnk . Taking
wherethe negative
yk (φlogarithm
= paramètres then gives× K matrix of target variables with elements
Pour estimer ynk les n ), and T is an, N on minimisera une fonction
where
tnk(ley
. Taking
nk = y (φ ), and
the negative
k n T is an
logarithmde N × K
thenlagivesmatrix of target variables with elements
d’erreur logarithme négatif
tnk . Taking the negative logarithm then gives vraisemblance)
##N K :!
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − Ntnk# ln ynk
K (4.108)
#
N K
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK )n=1 =− k#=1 # t ln ynk (4.108)
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − tnknk ln ynk (4.108)
which is known as the cross-entropy error function for then=1 multiclass
n=1 k=1
k=1
classification
problem. which is known as the estcross-entropy error function for the multiclass classification
CetteWe fonction
whichnow istake
known de
the cout
as the cross-entropy
gradient of theconnue
error errorsous
function function
withlerespect
nom!
for thetomulticlass
one of theclassification
param-
problem.
problem.wj . Making use
eter vectors « Cross-entropy
of the result (4.106) for loss »!
the derivatives of the softmax
We now take the gradient of the error function with respect to one of the param-
4.18 Enfunction,
calculant We
we now
obtain
son takegradient,
the gradient of the error function minimisée
with respect to :!one of the param-
eter vectors wj . Making useelle of thepeut result être
(4.106) for the derivatives of the softmax
eter vectors wj . Making use of the result (4.106) for the derivatives of the softmax
rcise 4.18
ise 4.18 function,
function, wewe obtain
obtain # N
∇wj E(w1 , . . . , wK ) = (ynjN− tnj ) φn (4.109)
n=1 # N #
∇ E(w , . . . , w ) = (y(y nj − tnj ) φ (4.109)
1 , . . . , wK ) = nj − tnj ) φn n (4.109)
∇wjwE(w j 1 K
[C. Bishop,
n=1 Pattern recognition and Machine learning, 2006]!
n=1
Application : détection d’objets simples!
Une fenêtre de taille 20x20 est glissée sur l’image. !
Les pixels d’une fenêtre sont donnés comme entrées!

[A. Mihoub, C. Wolf, G. Bailly, 2014]!


La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x) !
p(C2 |x)

Classifieurs génératifs linéaires


0.8
class densities

3
0.6
2

(Bayésiens)!
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

!
x x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.

1994; Tarassenko, 1995). Classifieurs discriminatifs linéaires!


be of low accuracy, which is known as outlier detection or novelty detection (Bishop,

However, if we only wish to make classification decisions, then it can be waste-


ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities

!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a

Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities

!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected

tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using

SVM (Support Vector Machines)!


1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
5.1. Feed-forward Network Functions 227
Réseaux de neurones!
5.1.Régression
Feed-forward Network
logistique Functions
: les fonctions de bases sont limitées, à choisir
manuellement pour
The linear models chaque and
for regression application.!
classification discussed in Chapters 3 and 4, re-
Et spectively,
si on les are based on linear combinations of fixed nonlinear basis functions φj (x)
apprenait?!
! and take the form !M #
"
! y(x, w) = f wj φj (x) (5.1)
j =1
!
where f (·) is a nonlinear activation function in the case of classification and is the
Si les fonctions de base sont de la même forme que les fonctions des
identity in the case of regression. Our goal is to extend this model by making the
sortie
basis:! functions φj (x) depend on parameters and then to allow these parameters to
! be adjusted, along with the coefficients {wj }, during training. There are, of course,
many ways to construct parametric nonlinear basis functions. Neural networks use
! basis functions that follow the same form as (5.1), so that each basis function is itself
! a nonlinear function of a linear combination of the inputs, where the coefficients in
! the linear combination are adaptive parameters.
This leads to the basic neural network model, which can be described a series
f etofhfunctional
sont des fonctions non-linéaires
transformations. (« fonctions
First we construct d’activations »).
M linear combinations of the input
Généralement, D in thede
variables x1 , . . . ,ilxs’agit form
sigmoid, de tanh ou de softmax.!
! D
" (1) (1)
! aj = wji xi +[C.wBishop,
j0 Pattern recognition and Machine(5.2)
learning, 2006]!
the set of weight parameters by defining an additional input variable x0 whose value
is clamped at x0 = 1, so that (5.2) takes the form
Réseaux de
!
neurones! D
(1)
aj = wji xi . (5.8)
Cela donne un réseau de neurones à deux couches (une couche
i=0
cachée, une couche de sortie) :!
We can similarly absorb the second-layer biases into the second-layer weights, so
« Multi-layer
that the overall network perceptron » (MLP)!
function becomes
"M "D ##
! (2) ! (1)
yk (x, w) = σ wkj h wji xi . (5.9)
j =0 i=0

As can be seen from Figure 5.1, the neural network model comprises two stages
(Les biais ont été intégrés)!
of processing, each of which resembles the perceptron model of Section 4.1.7, and
for this reason the neural network is also known as the multilayer perceptron, or
MLP. A key difference compared to the perceptron, however, is that the neural net-
work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per-
...!
ceptron uses step-function nonlinearities. This means that the neural network func-
tion is differentiable with respect to the network parameters, and this property will
play a central role in network training.
If the activation functions of all the hidden units in a network are taken to be
linear, then for any such network we can always find an equivalent network without
hidden units. This follows from the fact that the composition of successive linear
transformations is itself a linear transformation.
Couche Couche However,
Coucheif the number of hidden
units is smaller thand’entrée!
either the number of input or output
cachée! units, then the transforma-
de sortie!
Quelques remarques!
•  Le nombre d’unités dans la couche caché est adaptable.!

•  Plusieurs couches cachées sont possibles. Nous traiterons les type


feed-forward : absence de cycles dans les connexions.!

•  Les fonctions de sorties sont dérivables par rapport aux paramètres


du réseau.!

•  Si les fonctions d’activations sont linéaires, alors un réseau


équivalent peut être trouvé sans couche cachée.!

•  Les réseaux de neurones peuvent approximer n’importe quelle


fonction à une précision arbitraire si la quantité d’unités est
suffisante.!

.!
..
5. NEURAL NETWORKS
Un exemple!
gure 5.4 Example of the solution of a simple two- 3
class classification problem involving
synthetic data using a neural network
having two inputs, two hidden units with 2
‘tanh’ activation functions, and a single
output having a logistic sigmoid activa- 1
tion function. The dashed blue lines
show the z = 0.5 contours for each of 0
the hidden units, and the red line shows
the y = 0.5 decision surface for the net- −1
work. For comparison, the green line
denotes the optimal decision boundary −2
computed from the distributions used to
generate the data.
−2 −1 0 1 2

symmetries, and thus any given weight vector will be one of a set 2M equivalent
weight[C. Bishop,
vectors . Pattern recognition and Machine learning, 2006]!
Similarly, imagine that we interchange the values of all of the weights (and the
bias) leading both into and out of a particular hidden unit with the corresponding
values of the weights (and bias) associated with a different hidden unit. Again, this
clearly leaves the network input–output mapping function unchanged, but it corre-
sponds to a different choice of weight vector. For M hidden units, any given weight
be viewed as performing a nonlinear feature extraction, and the sharing of features
between the different
evaluations,
generalization.
effort
Apprentissage : !
each ofoutputs
whichcan save on
would computation
require O(W )and can also
steps. lead the
Thus,
needed to find the minimum using such an approach would be O(W 3 ).
to improved
computational
Finally, we consider the standard multiclass
thatclassification problem in whichinformation.
each
Now
Because
compare
descente de gradient!
this with an algorithm makes use of the gradient
input is assigned to one of K mutually exclusive classes. The binary target variables
each evaluation of ∇E brings W items of information, we might hope to
tk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the network
find
La the are
minimum
fonction of the
d’erreur function
as est
yk (x,celle
inp(t
w) =de
O(W )1|x),
gradient evaluations. As:! we shall see,
outputs interpreted kla=régressionleading logistique
to the following error
by using error backpropagation, each such evaluation takes only O(W ) steps and so
function
the minimum can now be found in ! K 2 ) steps. For this reason, the use of gradient
NO(W
!
information forms theE(w) basis = kn ln yk (xn ,for
of−practical talgorithms w).training neural networks.
(5.24)
n=1 k=1
5.2.4 Gradient descent optimization
Minimisation itérative par descente de gradient (un pas dans la
The simplest approach to using gradient information is to choose the weight
direction du plus
update in (5.27) grand achangement)
to comprise small step in the:!direction of the negative gradient, so
that
w(τ +1) = w(τ ) − η∇E(w(τ ) ) (5.41)
where the parameter η > 0 is known as the learning After each
rate.(Learning
Vitesse rate)!such update, the
36
gradient is re-evaluated for the new weight vector and the process repeated. Note that
5. NEURAL NETWORKS
Possibilité
the error function de blocage
is defined dans un minimum
with respect local
to a training set, :!and so each step requires
thata the entire training set be processed
Figure 5.5 Geometrical view of the error function E(w) as
surface sitting over weight space. Point w is
A
E(w) in order to evaluate ∇E. Techniques that
use Atathe whole w , thedata set ofattheonce
local minimum and w is the global minimum.
any point
C
B
local gradient error are called batch methods. At each step the weight
vector is moved in the direction of the greatest rate of decrease of the error function,
surface is given by the vector ∇E.

and so this approach is known as gradient descent or steepest descent. Although


such an approach might intuitively seem reasonable, in fact it turns out to be a poor
algorithm, for reasons discussed in Bishop and Nabney 1 w (2008).
w w
For batch optimization, there are more efficient w
methods, such as conjugate gra-
A B C

dients and quasi-Newton methods, 2 w


which are much more robust and much faster
∇E [C. Bishop, Pattern recognition and Machine learning, 2006]!
e minimum can now be found in O(W ) steps. For this reason, the use of grad
1999). Unlike gradient descent, these algorithms have the property that the error
dient-based
ormationfunction
formsalgorithm
the basis
always multiple eachtimes,
of atpractical
decreases eachthe
algorithms
iteration unless time for
weightusing ahasdifferent
training
vector neural
arrived randomly
at a networks
starting local
point, andminimum.
comparing the resulting performance on an independen
ion5.2.4 Deux stratégies : batch vs. en ligne!
or global
Gradient
In order to finddescent
set. gradient-based a sufficiently optimization
good minimum, it may be necessary to run a
algorithm multiple times, each time using a different randomly cho-
There is,senhowever,
The simplest point,an
approach
starting andon-line
to using
comparing version
the gradientofperformance
resulting gradient
informationondescent is tothat
an independent has proved
choose
vali- the we
practice
Batch dation set.
for: There
training
toutes neural
les données networks
sontstep on
utilisées large
pour data
chaque sets (Le
étape! Cun et al., 1
date in (5.27) to comprise
is, however, an on-line version of gradient descent that hasthe
a small in the direction of negative
proved useful gradien
ator functions
! based
in practice for on maximum
training likelihood
neural networks for sets
on large data a set(Le of
Cunindependent
et al., 1989). observ
mprise ! a sum of terms,
Error functions basedone
w τfor
+1) each
on(maximum
= data
(τ ) point
likelihood
w − for a set of independent
η∇E(w (τ )
) observations
(5
comprise a sum of terms, one for each data point
!
here the parameter η > 0 is known as the ! NN
! learning rate. After each such update
!
adient is re-evaluated for the new E(w) weight
E(w)
=n=1vector
= En (w). and the process repeated.
En (w). (5.42) Note
e error! function is defined with respect nto=1a training set, and so each step requ
On-line gradient descent, also known as sequential gradient descent or stochastic
at the! entire training set be processed in order to evaluate ∇E. Techniques
gradient descent, makes an update to the weight vector based on one data point at a
e-line
the En gradient
whole
time,data
ligne descent, also known
thatset at once
(stochastic
so gradient are descent)
called as batch
sequentialmethods.
: le gradient gradient At each
est calculé descentstepun
pour or
thestoc we
adient
ctor isseul descent,
pointin
moved makes
(une
theseulean donnée)
update
w of the
direction
( τ +1)
=to the
àwchaque
τ
− weight
greatest
( )
(w vector
η∇Eitération
nrate
( τ )
). :! based of
of decrease ontheoneerror
(5.43) datafunc poin
de,so
so! this
thatapproach is known as gradient descent or steepest descent. Altho
ch an! approach might intuitively w(τ +1) =seem − η∇En (w
w(τ )reasonable, (τ )
in fact ). it turns out to be a p
gorithm, ! for reasons discussed in Bishop and Nabney (2008).
For batch optimization, there are more efficient methods, such as conjugate
ents and quasi-Newton methods, which are much more robust and much fa
an simple gradient descent (Gill et al., 1981;[C. Fletcher, 1987; Nocedal and Wr
Bishop, Pattern recognition and Machine learning, 2006]!
error function is to use finite differences. This can be done by perturbing each weig
in turn, and approximating the derivatives by the expression
Calcul du gradient!
∂En En (wji + ϵ) − En (wji )
= + O(ϵ) (5.6
La minimisation de∂w
l’erreur
ji nécessite le
ϵ calcul de son gradient !

where ϵ ≪ 1. In a software simulation, the accuracy of the approximation to t


derivatives can be improved by making ϵ smaller, until numerical roundoff proble
arise. The accuracy
Chaque of the finite
dérivée partielle differences
pourrait method
(en théorie) can be improved
se calculer par une significan
by différence
using symmetrical
finie : ! central differences of the form
∂En En (wji + ϵ) − En (wji − ϵ)
= + O(ϵ2 ). (5.6
∂wji 2ϵ
Problème
In this : efficacité
case, the très mauvaise.
O(ϵ) corrections Pour
cancel, as chaque donnée,by
can be verified pour W expansion
Taylor
paramètres (poids), nous avons 2*W « stimulations » à faire, chacune
the demandant
right-hand side of (5.69), and
W calculs -> O(W )!2 so the residual corrections are O(ϵ2
). The numb
of computational
! steps is, however, roughly doubled compared with (5.68).
The main problem with numerical differentiation is that the highly desirab
O(W ) scaling has been lost. Each forward propagation requires O(W ) steps, a
.!
..

[C. Bishop, Pattern recognition and Machine learning, 2006]!


Here we shall consider the problem of evaluating ∇En (w) for one such te
n be accumulated overNthe training
error function. This may
set
be
in the
used
casefor
directly
ofsequential
batch methods.
optimization, or th
!
Consider first a simple linear model in which the (5.44) outputs yk are linear com
Retro-propagation du gradient (1)!
can
E(w)
ons of the input variables
E n (w). over the training set in the
be=accumulated
i soathat
n=1 xfirst
Consider
case of batch methods.
simple linear model in which the outputs yk are linear c
tions
consider the problem ofof the input ∇E
evaluating variables x!i so that
n (w) for one such term in the
. ThisIlmay
est be
possible de calculer
used directly = dewmanière
yk optimization,
le gradient
for sequential ori!
ki x directe.!
the results
yk = wki xi
ulated Exemple
over the training set in the case of batch methods.
très simple : fonction d’activationi linéaire (à la sortie), erreur
i
first a de
simple
typelinear model in
« somme deswhich the outputs
carrés » :! yk are linear combina-
gether withxan
put variables i soerror
that function
together that,function
with an error for a particular input pattern
that, for a particular n, takes
input pattern theth
n, takes
! ! 1 !
yk = wki xi 1 En = (ynk(5.45)
2 − tnk )
2
En = (ynk2− tnk )
i
2 k
an error function that, for a particular input k
pattern n, takes the form
where
Pour les poids ! y = y (x , w). The gradient
de la couche de sortie, on peut donner
nk k n of this error function
les dérivéeswith respect to
here ynk = yEk (xw=nji:!,1isw).
directement givenTheby gradient
(ynk − tnk )2
of this error function (5.46)
with respect to a
n ∂En
ji is given by 2
k = (ynj − tnj )xni
∂E ∂w ji
yk (xn , w). The gradient n
whichof this
can beerror
interpreted =as(y
function with
a nj respect
‘local’ nj )xni
to a weightinvolving the product of
− tcomputation
y signal’ ynj − tnj∂w associated
ji with the output end of the link wji and the var
∂En
hich can be interpreted = (ynjas
associated
∂wformula
− atnj‘local’
with )xniinputcomputation
the end of the link.involving Sectionthe
In(5.47) product
4.3.2, we sawof how an
gnal’ y − t
ji
associated ariseswith
with thethe logistic
output sigmoid
end of activation
the link function
w andtogether
the with
variab
nj as a nj
interpreted ‘local’ computation involving
andthe productfor of an softmax ji
the‘error
.!
..

entropy error function, similarly activation function


sociated withwith
tnj associated thetheinput
with output end
end of
its matching ofcross-entropy
thethe
linklink.
wji andIn Section
the
error variable4.3.2,
function. xWe we saw how a s
ni shall now see how thi
rmula arises
th the input endwith thethe
ofresult link.logistic
In Section
extends to thesigmoid
4.3.2,
more activation
we
complexsawsetting
how aof function
similar
multilayer together
feed-forward withnetw
[C. Bishop, Pattern recognition and Machine learning, 2006]!
the
s with the logistic sigmoid In a activation function together
general feed-forward network,witheachthe cross
unit computes a weighted su
ow yofvector information through
Substituting the network.
(5.51) and (5.52) into 5.3. Error
(5.50), we Backpropagation
then obtain 243
input
nal’ − to the
associated network with andthe calculated
output end the
of activations
the link of
and all
theof variable
ider
nj
the
output with
ociated
t nj
evaluation
units thein the
input of
network the
end ofby derivative
thesuccessive
associated
link. In Section of E
application
with
with
n 4.3.2,
w
of we
ji the input endniof the link. In
respect
(5.48)sawand to aa weight
how similar
x

eutsusing
ocess
mula of
arisesRetro-propagation du gradient (2)!
is the the
often various
with canonical
called units
the forward
where logistic
zi is thewilllink depend
propagation
sigmoid
activationas formula
the because
activation
of unit,arises
onaoutput-unit
the itparticular
orcan
function input, with
be ∂E regarded
together
that the
ninput
sends alogistic
withpattern
activation the cross
connection n. sigmoid
function. to unit To
j, activ
and wj
= δj zi .
rder
w
opy tohidden
of error keep
information function,the thenotation
through
and the associated
similarlynetwork. forentropy
uncluttered, the weuse
softmax error
shall function,
omit
activation the
function and
subscript5.1,similarly
together n sawpartial for the
shderfor
itsthe evaluation
isunits,
of
weight
the
we again make
with
with
that of
connection.
We respect to
the
∂w
asee
In
ji chain
weight
Section rule we for that biases can
ork matching
variables. cross-entropy
beFirst wederivative
included noteerror
in this thatsumof E
function.
with Ebynn depends
shallon
introducing
its matching now
an the extra how
unit,this
weight or
cross-entropy wji simple
only
input, with activation
error fixed
functio
ult
ed ts extends
of the
input various
Il manqueato units
theatmore Equation
will
lesj.complex
+1. dérivées
We (5.53)
depend
thereforesetting ontells
par dotheof
not us that
particular
multilayer
rapport need thedeal
aux
to required
input
feed-forwardpattern
paramètres
with derivative
biases networks.
despartial
n. explicitly. is obtained
couches The sumsimply by ism
in (5.48)
j to unit We can therefore
! apply the chain rule for
In atogeneral
der keep the
cachées. feed-forward
notation
the value
Rappelons
transformed ∂E network,
uncluttered,
nofale
by for
δnonlinearresult
calcul we ∂E
each
the unit
shall
unit extends
fait nat ∂a
computes
omit
par
activationtheune kthe to
output the
aunité
weighted
subscript
function end more sum
ofntothe
caché :!complex
of
weight
give the its by the
activation setting
valueof ofofzj
unit
give
uts of the formFirst δjwe ≡the = h(·) (5.55)z j
rk variables.
! in at
the ∂E
noteinput
form
n∂a that end
j ∂E
En !
n
depends
of∂a thej
In
∂a ona the
weight
k general
∂a weight w
(where
j
zfeed-forward
=
ji only
1 in the case network,
of a bias). each
Note tha
d input aj to unit j.the Wesame can= therefore
aj = askapply
form for .zthe
wjithe chainz rule
i simple =
linear for partial
). (5.50)
(5.48) at the start(5.49
inputs of the j form h(a j model considered of th
ive ∂wjiin order
Thus,
∂aj to∂w ji
i evaluate the derivatives, we need only to calculate !the
runs over all units
Note
∂En that k oneto orwhich
∂Encalculera∂a more ofunit
the j sends
variables z connections.
in the sum in (5.48) The couldarrange-
be an
= input, and
a j (5.53). w
j i
Pour
duce a useful notation simplifier, for = on
each hidden .inand d’abord
output unit les indérivées
the network, par rapport
(5.50) and then apply à !
nd weights similarly,
is∂willustrated
ji
the
∂aj ∂winunit (5.49) could be an
ji Figure 5.7. Note that the units labelled k
j output.
! For each pattern seen
As we have in thealready,
training for set, thewe shalloutput units, that
suppose we havewe have supplied i the
thera useful
uce hidden units
notation
corresponding and/or output
∂E n
δj ≡input vector to the network
units. In writing Pour la and down
calculated
couche (5.55),
de sortie the we areof all o
:! activations
(5.51)
he fact that the variations
hidden ∂E andnin ∂aajj give
output units in risethe to networkvariations
δkby =successive
yk −in tk the error func-
application of (5.48) and
δ ≡ (5.51)
gh variations(5.49). in the j This process is often called forward propagation because it can be regarded
variables
∂aj ak . If we now substitute the definition
are often referred as a forward flow of information we
to as errors for reasons through shallthesee shortly. Using
network.
5.51)
5. NEURAL
rewrite
intoNETWORKS
Chaque
often referred to as
(5.55),
dérivée
Now and
errors
peut
consider make être use
for reasons
calculée
the evaluation of (5.48)
we shall see
à partir
of the and des(5.49),
derivative
shortly.
dérivées wedeobtain la couche
Usingof En with respect to a weigh
the
propagation
write suivantew formula
:!
ji . The outputs∂aj of the various units will depend on the particular input pattern n
ure 5.7 Illustration of the∂a
However,
! backpropagation calculation of
j in order =δjtozfori .keep
hidden unit j by
the notation uncluttered, we (5.52)
shall omit the subscript n
unit j sends from
of the∂w= ji
δ’s
thejinetwork
connections.
from
zi . those!
variables.
units k to which
First we
zi (5.52)
the note that Enδjdepends on the weight wji only
δk
∂w ′The blue arrow denotes
5.51) and (5.52)
direction
via into
of information
the = flow
δj summed
(5.50),h (a we
duringj )then
input a tow
forwardobtain δ
propagation,
kj
unit k
j. We can
wji
therefore wkj
apply the chain (5.56)
rule for partia
51) andand (5.52) intoarrows
the red (5.50), we then
indicate obtain propagation
the backward j
derivatives to give
of error information. k zj
δ1
∂En∂En ∂En ∂En ∂aj
= δj z= i . δj zi .
= (5.53). (5.53) (5.50
that the value of ∂wjiδ∂w forji a particular hidden ∂w ∂a
ji [C. Bishop, unit ∂w
j Patterncan
ji be obtained
recognition by2006]!
and Machine learning,
provided we are using the canonical link as the output-unit activation function. To
Retour : généralisation, !
sélection de modèles!
Complexité du modèle de prédiction : !
•  Nombre de couches cachées!
•  Nombre d’unités cachées par couche!
5.5. Regularization in Neural Networks 257

1 M =1 1 M =3 1 M = 10

0 0 0

−1 −1 −1

0 1 0 1 0 1
Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The
graphs show the result
1 couche cachéeof fitting networks
avec M having
unités!M = 1, 3 and 10 hidden units, respectively, by minimizing a
sum-of-squares error function using a scaled conjugate-gradient algorithm.
Données générées avec!
of the form
! λ
E(w) = E(w) + wT w. (5.112)
2
[C. Bishop, Pattern recognition and Machine learning, 2006]!
This regularizer is also known as weight decay and has been discussed at length
Arrêt prématuré (early stopping)!
L’apprentissage est itérative (1 itération est appelée 1 « epoch »).!
On peut diminuer le sur apprentissage (overfitting) en arrêtant
l’apprentissage lorsque l’erreur sur
5.5.la base de validation
Regularization commence261
in Neural Networks à
augmenter.!
! 0.45

0.25

0.4

0.2

0.15 0.35
0 10 20 30 40 50 0 10 20 30 40 50
Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during a
typical training
Erreur sursession,
la baseasd’apprentissage!
a function of the iteration step, for the sinusoidal
Erreur data de
sur la base set.validation!
The goal of achieving
the best generalization performance suggests that training should be stopped at the point shown by the vertical
dashed lines, corresponding to the minimum of the validation set error.

[C. Bishop, Pattern recognition and Machine learning, 2006]!


parameter λ. The effective number of parameters in the network therefore grows
Application : !
ft-to-right
ght-to-left
p-to-down
own-to-up
ick up
ang up reconnaissance de gestes!
he data is stored for each frame and each training vector is
Classification de gestes à partir de la pose articulée d’un humain
dicate the label by pressing the button. So, this process is a ground
(positions des articulations estimées à l’aide d’un capteur Kinect).!
ther the proper objective data for the training.
d, we can update the neural network with a supervised learning as we
to its respective feature vectors.
ew data is a ground truth set, we can not update the Filter neural
ded is labelled as a gesture. As we need gesture samples and
e Filter neural network, the only classifier that is trained is the General

Réseau
Geste!
de
neurones!

ng of new vector labelling by pressing the correspondent key


Poses consécutives! Caractéristiques
(Positions des articulations) ! invariantes (point de vue,
12 personnes etc.)!
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x) !
p(C2 |x)

Classifieurs génératifs linéaires


0.8
class densities

3
0.6
2

(Bayésiens)!
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

!
x x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.

1994; Tarassenko, 1995). Classifieurs discriminatifs linéaires!


be of low accuracy, which is known as outlier detection or novelty detection (Bishop,

However, if we only wish to make classification decisions, then it can be waste-


ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities

!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a

Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities

!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected

tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using

SVM (Support Vector Machines)!


1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
29 Plots of the quantity Lq = |yhave
has
− received
just
t|q for moreofand
occurred,
various values information thanthat
q. if we knew if we
the were
event to
w

|y − t|q
has just occurred,
receive no and if we
1 information. Ourknew that of
measure theinformatio
event w
Bases : théorie
receive
on the no de
probability
l’information!
information. Our measure
distribution
h(x) = − log p(x)
of information
p(x), and we therefo
(1.92)
onis the
a probability
monotonic
2
distribution
function of the p(x), and we p(x)
probability therefoan
Entropie h(x) d’une variable aléatoire x : mesure du contenu
iscontent.
where the negative sign ensures a monotonic
that information
The formfunction
isof of the
positive
h(·) or
canprobability
zero.
be Note
found that
bylow
p(x) and
notin
d’information.!
probability events x correspond to high information content.
content.
and y The
that0 are form of h(·)
unrelated, canThe
then be choice
the found ofbybasis
information noting
gai
−1 for the0être
Peut 1 isde
dérivé
logarithm deux2 variables
arbitrary, and for−2the −1 we shall
indépendantes
moment 0 adopt the
:! 1 convention 2
and
shouldy that be are
the unrelated,
sum of the then the information
information gained gain
from
prevalent in information theory of using logarithms to the base of 2. In this case, as
y − t y − t
should
h(x, y)be=are
the sum+of
h(x) the information
h(y). Two unrelated gained
events fromwille
we shall see shortly, the units ofq h(x) bits (‘binary digits’).
29 PlotsNow
of thesuppose
quantitythat
Lq = |yh(x,
− y)
sot| p(x,
a sender =y)h(x)
for various
wishes + h(y).
values
to=transmit ofthe
p(x)p(y). Two
From
q. value ofunrelated
a these
randomtwo events will
relationsh
variable to
a receiver.
ExerciseThe 1.28 somust
average amount p(x, y)
ofbe = p(x)p(y).
information
given by the From
that logarithm
they transmitthese twoprocess
ofinp(x)
the relationshi
and so iswe h
Exercise
obtained by 1.28 must beofgiven
taking the expectation (1.92)by withthe logarithm
respect of p(x) and
to the distribution p(x)soand
we ha
h(x) = − log2 p(x) (1.92)
is given by !
where the negative
52 1.sign
INTRODUCTION H[x] =
ensures that p(x) log2isp(x).
− information positive or zero. Note that (1.93)
low
probability events0.5 x correspond to high x information
0.5 content. The choice of basis
for
Thisthe logarithm
important is arbitrary,
quantity is called andthe forentropy
the moment of the we random shall variable
adopt thex. convention
Note that
in information theory of using logarithms to the base of 2. In this case, as
H = 1.77 H = 3.09
prevalent
lim p→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter a
we shall
value forsee shortly,
x such that the units
= 0.of h(x) are bits (‘binary digits’).
probabilities

probabilities

p(x)
Now
So farsuppose
we havethat
0.25
givena sendera rather wishes to transmit
heuristic
0.25
motivation the valuefor theofdefinitiona randomofvariable informa- to
a receiver. The average amount of information that they transmit in the process is
obtained by taking the expectation of (1.92) with respect to the distribution p(x) and
is given by 0
! 0

H[x] =entropy from alog


[C. Bishop,
Figure 1.30 Histograms of two probability distributions over 30 bins illustrating Patternvaluerecognition
of the entropy and Machine learning, 2006]!
H for the broader distribution. The largest − would arise
p(x) the higher
p(x).
uniform distribution that would give H = (1.93)
Arbres de décision!
(𝐼, x)

tree 1 Classification
extrêmement
rapide!!
𝑃
𝑃 (𝑐)

sses indicates the Figure 4. Randomized Decision Forests. A forest is an


•  A chaque nœud d’un arbre, une valeur du vecteur est examinée !
the • offset pixels
Parcours à gauche of ou trees. Eachcomparaison
droite selon tree consists
avecofunsplit
seuil nodes (blue) and
(différent
ures givepourachaque
large sommet)!(green). The red arrows indicate the different paths tha
•  Chaque
features feuille contient
at new takenunebydécision (unetrees
different classe
for a)! particular input.
•  Paramètres :!
3.3. Randomized
•  Choix des éléments decision forests
pour chaque sommet!
o disambiguate
•  Seuil pour chaque Randomized
sommet! decision trees and forests [35, 30,
•  pour chaque feuille!
proven fast and effective multi-class classifiers
Forêts aléatoires!
(𝐼, x) (𝐼, x)

tree 1 tree 1

𝑃 𝑃
𝑃 (𝑐) 𝑃 (𝑐)

es. The
cates theyellowFigure
Plusieurs arbres
crosses Randomized
indicates
sont4.entrainés sur des Decision
FigureForests.
the sous-ensembles 4. Randomized
A forest
différents Decision
des is Fo
an ensemb
données!
et
redpixels
circles
Chaqueindicate
feuille the offset
of contient
trees. Each pixels
tree consists
une distribution ofoftrees.
split nodes
d’étiquettes! Each tree(blue)
consists
and leafof split
nod
ee atwo
large
! example(green).
features Thegivered
a large
arrows indicate
(green).
theThedifferent
red arrows
paths indicate
that might
theb
!
(b),
at new the sametaken
two features at new
by different trees for ataken
Addition des distributions correspondantes :!
particular
by different
input. trees for a particul
maller response.
3.3. Randomized decision
3.3. Randomized
forests decision fo
biguate
w the classifier Randomized
to disambiguate
decision trees
Randomized
and forestsdecision
[35, 30, 2,
trees
8] hav
an
e body. proven fast and effectiveproven
multi-class
fast and
classifiers
effectiveformulti-
man
, (1)pixel x, the features compute
a given 1 X
◆ ✓ ◆P (c|I, x) = all treesPint (c|I, the forest
x) . to give the final
(2) classification
u v T t=1
xparame-
+
dI (x) Apprentissage des paramètres!
dI x +
dI (x)
, (1)
alization (𝐼, x)Training. Each tree is trained on a different set of randomly
P (c|I, x) =
1 XT

T t=1
Pt (c|I, x) . (2)
depth at pixel x in image I, and parame-
h invari- synthesized Chaque images.arbre A random subset of
est appris de2000 example
manière pix-
indépendante !
cribe offsets u and v. The normalization Training.
ce1ensures
ee offset the features
els from each image
Apprentissage
are depth invari- is chosen
couche to
Each
ensure
par a
tree is
roughly
couche
trained
è
even
on
le
a
dis-
different
gradient
set
de
of randomly
x) synthesized images. A random subset of 2000 example pix-
camera.
t on the body, a tribution
fixed world across
space body
l’erreur n’estparts.
offset pas Each tree image
elsdisponible!!!
from each is trained usingtothe
is chosen ensure a roughly even dis-
dulo per-is closefollowing
𝑃! algorithm [20]: tribution across body parts. Each tree is trained using the
the pixel or far from the camera.
uskground
3D translation 𝑃invariant
1.(𝑐)Randomly
1. (modulo
Pour per-
propose
un couche a following
set of algorithm
splitting
donnée, [20]:
candidates
proposition d’un
= ensemble
ean dIoffset
(x ) pixel lies on
0
(✓,the
⌧ ) background
(feature
de candidats parameters 1. ✓Randomly
deand propose⌧ a). set of splitting candidates
thresholds
paramètres =
ds of4.the
gure image, theDecision
Randomized depth probeForests. dI (x 0
)
A forest is an ensemble ) (feature )!parameters ✓ and thresholds ⌧ ).
(✓,,⌧seuil
tive constant
trees. Each treevalue.2. Partition (choix
the élément
consists of split nodes (blue) and leaf nodes = {(I, x)} into left
set of examples Q
ocations 2. données
Partition
two features
een). at different
The red arrows andpixel
indicate right
2.  the subsets paths
Séparation
locations
different by each
des : be thed’apprentissage
that might
set of examples Q en = {(I, x)} into left
2 parties!
rge pos- and right subsets by each :
ksenupwards:
by different Eq. 1 will
trees for agive a large
particular pos-
input.
dy, but a Q ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3)
ixels x near the top of the body,l but a Ql ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3)
3. Randomized
dy. pixels
for decision
Fea- x lower down the body. forests
Qr ( Fea-) = Q \ Ql ( ) Qr ( ) = Q \ Q(4)
! l( ) (4)
res such
dRandomized
help find thin decision
verticaltrees and forests
structures such[35, 30, 2, 8] have
oven fast and effective 3. Computemulti-class the classifiers
giving the 3.forlargest
many gain
Compute the in giving
information:
the largest gain in information:
sek features
ks signal
[20, 23, 36], and can
provide 3.  Choix
onlybea implemented
weak?
des paramètres
signal efficiently on the ? candidats maximisant le gain
= = argmax G( (5) ) (5)
PU
f the
o, [34].
but inAsthe
body illustrated in Fig.en
pixel belongs to, inargmax
4, information
a forest
but is an G( )
(mesure
ensemble d’entropie)!
cision
T accu-foresttrees,
to decision they each
are sufficient
consisting to accu-
of split and leafX nodes. X |Qs ( )|
allsplit
trained
nodeparts. Theofdesign |Q ( )| H(Qs ( )) (6)
ch
of these consists G( ) of=these
a feature and a thresholdG(⌧ . ) s = H(Q
f✓ H(Q) H(Q) ( )) (6)
s |Q|
yclassify
onal
motivated by their computational effi-
effi-pixel x in image I, one starts at the roots2{l,r} and re- |Q| s2{l,r}
essing
atedly isevaluates
needed; Eq.each1,feature
branching need leftonlyor right according
eed
ethe only
pixels and perform atwhere
most 5Shannon
where Shannon entropy H(Q) ... is computed on the nor-
Shannon entropie!
comparison to threshold ⌧arithmetic
. At theentropy
leaf nodeH(Q) reachedis computed on the nor-
malized histogram of body part labels lI (x) for all
ithmeticcan be straightforwardly
features 4.  PRécursion!
imple-
tree t, a learned distribution
malized thistogram of body
(c|I, x) over body partpart
la- labels l (x) for all
- “R”: Right hand is used for perform the gestures.

pman Andrew Blake - “L”: Left hand is used for perform the gestures.
- “B”: Both hands are used for perform the gestures.

dge Application
& Xbox Incubation: estimation de la pose After that, we can collect data of a desired class everytime we want by pressing th
button on your computer keyboard. The buttons for collect each class data are:
- “1“: Stop
- “2”: Scroll: left-to-right
- “3”: Scroll: right-to-left

humaine (MS Kinect)! - “4”: Scroll: up-to-down


- “5”: Scroll: down-to-up
- “6”: Phone: pick up
- “7”: Phone: hang up
- “8”: Hi

While the button is pressed , the data is stored for each frame and each training ve
automatically labeled as we indicate the label by pressing the button. So, this proc
truth collection because we gather the proper objective data for the training.
Once we have new data stored, we can update the neural network with a supervise
have the labels corresponding to its respective feature vectors.
Taking into account that the new data is a ground truth set, we can not update the
network because all data recorded is labelled as a gesture. As we need gesture sam
no-gesture samples to train the Filter neural network, the only classifier that is tra
neural network.

Figure 9. Recording of new vector labelling by pressing the correspondent key

depth image body parts 3D joint proposals


Figure 1. Overview. From an single input depth image, a per-pixel
[J. Shotton et al., 2011]!
Données d’apprentissage!

synthetic (train & test)


Figure 2. Synthetic and real data. Pairs of depth image and ground truth body p
31 parties du corps humains!
1simplify the tasksynthétiques
Million d’images of background
issuessubtraction which
d’une pipeline we as-
de rendu (la moitiethe
à app
sumed’une
partir in this work.deBut
capture most importantly
mouvements humains)! for our approach, While d
it is pixels
2000 straightforward
de chaque to synthesize
image realisticde
= 2.000.000.000 depth images
vecteurs of plicitly
d’apprentissage!
3people
arbres,and thusde
chacun build a large training dataset cheaply.
20 niveaux! be enco
Chaque feuille : 1 histogramme de 31 valeurs = 32.505.856 valeurs! data to
2.2. Motion capture data
220-1 = 1.048.575 seuils! The
The
! human body is capable of an enormous range of parame
parts for left and right allow the classifier to disambiguate
the left and right sides of the body. p
Les caractéristiques!
Of course, the precise definition of these parts could be ta
changed to suit a particular application. For example, in an G
(a)upper body tracking scenario, (b)all the lower body parts could o
𝜃
be merged. Parts should be sufficiently 𝜃small to accurately E
localize body joints, but not too numerous as to waste ca- T
pacity of the classifier. p
𝜃 Depth image features
3.2. to
𝜃
We employ simple depth comparison features, inspired in
by those in [20]. At a given pixel x, the features compute b

Figure 3. Depth image features. The ◆yellow✓crosses indicates
◆ the aF
u v
pixel x being
f✓ (I,classified.
x) = dI The x + red circles indicate , (1)
dI x +the offset pixels o
dI (x) dI (x)
as defined in Eq. 1. In (a), the two example features give a large (
where dI (x)
depth difference is the depth
response. at pixel
In (b), x in image
the same I, and parame-
two features at new t
ters ✓ = (u,
image locations X ….
give v)a describe
#
much offsets
#position
smaller du u and! v. The normalization
pixel,
response. T
of the offsets by d# I 1(x)#offsets!
ensures the features are depth invari- s3
u,v …
dI …. # #profondeur!
parts for left and right allowthethe
ant: at a given point on body, a fixed world
classifier space offset
to disambiguate e
the leftwill
andresult
rightwhether
sides ofthethe pixel
body.is close or far from the camera. trp
The features are thus 3D translation invariant (modulo per-
La classification supervisée!
44 1. INTRODUCTION
kppV (k plus proches voisins)!
5

4
p(x|C2 )
1.2

1
p(C1 |x) !
p(C2 |x)

Classifieurs génératifs linéaires


0.8
class densities

3
0.6
2

(Bayésiens)!
0.4
p(x|C1 )
1
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

!
x x

Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.

1994; Tarassenko, 1995). Classifieurs discriminatifs linéaires!


be of low accuracy, which is known as outlier detection or novelty detection (Bishop,

However, if we only wish to make classification decisions, then it can be waste-


ful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities

!
p(Ck |x), which can be obtained directly through approach (b). Indeed, the class-
conditional densities may contain a lot of structure that has little effect on the pos-
terior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a

Réseaux de neurones!
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities

!
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected

tree 1
Forêts aléatoires!
!
KERNEL MACHINES 𝑃
𝑃 (𝑐)
stration of the ν-SVM for re-
ession applied to the sinusoidal
osses indicates
nthetic data Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
the
set using

SVM (Support Vector Machines)!


1
enels.
theThe predicted
offset regression of trees. Each tree consists of split nodes (blue) and leaf nodes
pixels
ve is shown by the red line, and t
atures give tube
ϵ-insensitive a large
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features at new
are shown in green, taken0 by different trees for a particular input.
d those with support vectors
e indicated by blue circles. 3.3. Randomized decision forests
to disambiguate −1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
SVM (« Support Vector Machines »)!
L’hypothèse de départ est que les données sont linéairement
séparables (cette hypothèse sera partiellement relâchée)!

Wikipedia!

Pour un ensemble de points linéairement séparables, il existe une infinité


d'hyperplans séparateurs!
SVM : la marge!
On cherche obtenir l’hyperplan ayant la marge maximale (distance
entre les échantillons de classes différentes).!
Objectif : réduire l’erreur sur les données non vues.!
SVM : « kernel trick »!
Certaines données ne sont pas linéairement séparables.!
Astuce : projection des données dans un espace où elles le sont.!
SVM : applications!
Les SVM ont été utilisées pour quasiment toutes les applications en
image / vision par ordinateur.!
Quelques applications développées au LIRIS :!

Reconnaissance d’actions Reconnaissance Reconnaissance


individuelles! d’actions collectives! de textes!

[Wolf et Taylor, 2010]! [Baccouche/Mamalet/Wolf/ [Wolf et Jolion, 2003]!


Garcia/Baskurt, 2012]!
Popularité des classifieurs!
Google N-gram viewer, requête le 24.7.2014 :!

Google fight !
(24.7.2014)!

You might also like