P. 1
Bayes

Ratings:
(0)
|Views: 9|Likes:

See more
See less

05/22/2015

pdf

text

original

# 1 1

3. 3. Bayes Bayes and Normal Models and Normal Models
Aleix M. Martinez
aleix@ece.osu.edu
Handouts for ECE 874 Sp 2007
Handouts for ECE 874 Sp 2007
Why Bayesian? Why Bayesian?
• If all our research (in PR) was to disappear
and you could only save one theory, which
one would you save?
• Bayesian theory is probably the most
important one you should keep.
• It’s simple, intuitive and optimal.
• Reverend Bayes (1763) and Laplace (1812)
set the foundations of what we now know a
Bayes theory.
State of nature: class
• A sample (usually) corresponds to a state of
nature; e.g. salmon and sea bass.
• The state of nature usually corresponds to a
set of discreet categories (classes). Note
that the continuous also exists.
• Priors: some classes might occur more
often or might be more “important,” ). (
i
w P
Decision rule
• We need a decision rule to help us determine
to which class a testing vector belongs.
• Simplest (useless):
• Posterior probability: , x is the
(observed) data.
• Obviously, we do not have .
• But we can estimate: &
) ( max arg
i i
w P C =
) | ( x
i
w p
) | ( x
i
w p
) | (
i
w p x ). (
i
w P
Bayes Theorem (yes, the famous
one)
• Bayes decision rule:
) (
) ( ) | (
) | (
x
x
x
p
w P w p
w p
i i
i
=
¿
=
=
c
j
i i
w P w p p
1
) ( ) | ( ) ( x x
) | ( max arg x
i i
w p
guarantees guarantees
1 ) | ( 0 s s x
i
w p
Rev. Thomas Bayes (1702-1761)
• During his lifetime, Bayes was a defender of Isaac
Newton's calculus, and developed several important
results of which the Bayes Theorem is his most known
and, arguably, most elegant. This theorem and the
subsequent development of Bayesian theory are among
the most relevant topics in pattern recognition and have
found applicationsin almost every corner of the
scientific world. Bayes himself did not, however,
provide the derivations of the Bayes Theorem as this is
now known to us. Bayes develop the method for
uniform priors.This result was later extended by Laplace
and contemporaries.Nonetheless, Bayes is generally
acknowledge as the first to have established a
mathematical basis for probability inference.
2 2
Multiple random variables
• To be mathematically precise, one should
this probability density function depends on
a single random variable X.
• In general there is no need for distinction
(e.g., & ). Shall this arise, we will use
the above notation.
(see Appendix A.4)
) | (
i X
w p x ) | (
i
w p x
X
p
Y
p
Loss function & decision risk
• States exactly how costly each action is, and
is used to convert a probability
determination into a decision.
} ,..., {
1 c
w w } ,..., {
1 a
 
) | (
j i
w  
classes: : actions: :
loss function: :
the cost ( the cost (risk risk) of going from ) of going from to to i

j
w
Conditional Risk:
¿
=
=
c
j
j j i i
w p w R
1
) | ( ) | ( ) | ( x x   
Bayes decision rule
• The resulting minimum overall risk is called
the Bayes risk.
¿
=
=
c
j
j j i i
w p w R
1
) | ( ) | ( ) | ( x x   
) | ( min arg x
i i
R 
Conditional risk: Conditional risk:
Bayes Bayes decision rule ( decision rule (Bayes Bayes risk): risk):
A simple example
• Two-class classifier:
• Decision rule: if
• Applying Bayes:
) | ( ) | ( ) | (
2 12 1 11 1
x x x w p w p R    + =
) | ( ) | ( ) | (
2 22 1 21 2
x x x w p w p R    + =
), | ( ) | (
2 1
x x   R R <
) (or
1 1
w 
) | ( ) ( ) | ( ) (
2 22 12 1 11 21
x x w p w p     ÷ > ÷
.
) (
) (
) | (
) | (
1
2
11 21
22 12
2
1
w P
w P
w p
w p
 
 
÷
÷
>
x
x
threshold threshold
Notation:
Notation: Notation:
) | (
j i ij
w    =
threshold threshold
Feature Space: Geometry
• When , we have our d-dimensional
feature space.
• Sometimes, this feature space is considered
to be an Euclidean space; but as we’ll see
many other alternatives exists.
• This allows for the study of PR problems
form a geometric point of view. This is key
to many algorithms.
d
9 e x
3 3
Discriminant functions
• We can construct a set of discriminant
functions: , i = 1, …, c.
• We classify a feature vector as if:
• The Bayes classifier is:
• If errors are to be minimized, one needs to
minimize the probability of error:
i
w
i j g g
j i
= ¬ > ), ( ) ( x x
) (x
i
g
). | ( ) ( x x
i i
R g  ÷ =
¹
´
¦
=
=
=
j i
j i
w
j i
1
0
) | ( 
minimum minimum- -error error- -rate rate
(zero (zero- -one loss) one loss)
• If we use Bayes & minimum-error-rate
classification, we get:
• Sometime we find more convenient to write this
equation as:
• Geometry (key point): the goal is to divide the
feature space into c decision regions,
• Classification is also known as hypothesis testing.
¿
=
= =
c
j
j j
i i
i i
w P w p
w P w p
w p g
1
) ( ) | (
) ( ) | (
) | ( ) (
x
x
x x
constant constant
). ( ln ) | ( ln ) (
i i i
w P w p g + = x x
. ,...,
1 c
R R
Key: the effect of any decision rule is to divide the feature
space into decision regions.
“Symbolic” representation
Other criterion
• In some applications the priors are not
known.
• In this case, we usually attempt to minimize
the worst overall risk.
• Two approaches for that are: the minimax
and the Neyman-Pearson criteria.
4 4
Normal Distributions & Bayes
• So far we have used and to
specify the decision boundaries of a Bayes
classifier.
• The Normal distribution is the most typical
PDF for p().
• Recall the central limit theorem.
) | (
i
w p x ) (
i
w P
Central Limit theorem (simplified)
Assume that the random variables
are iid, each with finite mean and variance.
When , the standardized random
variable converges to a normal distribution.
(see Stark & Woods pp. 225-230)
n
X X ,...,
1
· ÷ n
* *
= =
* *
= =
Univariate case
• The Gaussian distribution is:
) , ( ~
2
1
exp
2
1
) (
2
2
 


N
x
x p
(
(
¸
(

¸

|
.
|

\
| ÷
÷ =
}
·
· ÷
= ÷ dx x xp x E ) ( ) ( 
}
·
· ÷
÷ = ÷ ÷ dx x p x x E ) ( ) ( ) ) ((
2 2 2
  
Multivariate case (d>1)
) , ( ~ ) ( ) (
2
1
) 2 (
1
) (
1
2 / 1
2 /
E
(
¸
(

¸

÷ E ÷ ÷
E
=
÷
  

N x x p
T
d
x
5 5
Distances
• The general distance in a space is given by:
where is the covariance matrix of the
distribution (or data).
• If then the above equation becomes
the Euclidean (norm 2) distance.
• If is Normal, this distance is called
Mahalanobis distance.
) ( ) (
1 2
  ÷ E ÷ =
÷
x x
T
d
I = E
E
E
Example (2D Normals)
Heteroscedastic Heteroscedastic Homoscedastic Homoscedastic
Moments of the estimates
• In statistics the estimates are generally
known as the moments of the data.
• The first moment is the sample mean.
• The second, the sample autocorrelation
matrix:
.
1
1
¿
=
=
n
i
T
i i
n
S x x

Central moments
• The variance and the covariance matrix are
special cases, because they depend on the
mean of the data which is unknown.
• Usually we solve that by using the sample
mean:
• This is the sample covariance matrix.
. ) ˆ )( ˆ (
1
ˆ
1
¿
=
÷ ÷ = E
n
i
T
i i
n
  x x
Whitening Transformation
• Recall, it is sometime convenient to represent the
data in a space where its sample covariance matrix
equals the identity matrix, I.
I A Σ A =
T
Linear transformations
• An n-dimensional vector X can be
transformed linearly to another, Y, as:
• The mean is then:
• The cov.:
• The order of the distances in the transformed
space is identical to the one in the original
space.
X A Y
T
=
X
T
Y
M E M A Y = = ) (
A Σ A Σ
X Y
T
=
6 6
Orthonormal transformation
• Eigenanalysis:
• Eigenvectors:
• Eigenvalues:
• The trasformation is then:
ΦΛ ΣΦ =
|
|
|
.
|

\
|
=
p

0
0
1
 Λ
] [
p 1
Φ ,..., Φ Φ=
X Φ Y
T
=
Λ Φ Σ Φ Σ
X Y
= =
T
I Φ Φ =
T
(recall, & )
1 ÷
= Φ Φ
T
Whitening
• To obtain a covariance matrix equal to the
identity matrix we can apply the orthogonal
transformation first and then normalize the
result with :
2 / 1 ÷
Λ
I Φ Σ Φ Φ Σ Φ Σ
X Φ X Φ Y
X X X X Y
= A A = =
= A =
÷ ÷
÷
2 / 1 2 / 1
2 / 1
T T
T
X
T
A
----------- -----------
Properties
• Whitening transformations are not
orthogonal transformations.
• Therefore, Euclidean distances are not
preserved.
• After whitening, the covariance matrix is
invariant to any orthogonal transformation:
. I I = + + = + +
T T
Simultaneous diagonalization
• It is usually the case where two or more
covariance matrices need to be diagonal.
• Assume and are two covariance
matrices.
• Our goal is to have: and
• Homework: find the algorithm.
1
Σ
2
Σ
I A Σ A =
1
T
.
2 2
Λ A Σ A =
T
7 7
• Algorithms usually become much simpler after
diagonalization or whitening.
• The general distance => a simple Euclidean
distance.
• Whitened data is invariant to other orthogonal
transformations.
• Some algorithms require whitening to have certain
properties (we’ll see this latter in the course).
Discriminant Functions for
Normal PDFs
• The discriminant function,
for the Normal density, , is:
• Possible scenarios (or assumptions):
– Sometimes, we might be able to assume
– A more general case is when all covariance
matrices are identical ; i.e. homoscedastic.
– The most complex case is when
that is, heteroscedastic.
) ( ln ln
2
1
2 ln
2
) ( ) (
2
1
) (
1
i i i i
T
i i
w p
d
g + E ÷ ÷ ÷ E ÷ ÷ =
÷
   x x x
), ( ln ) | ( ln ) (
i i i
w p w p g + = x x
) , (
i i
N E 
.
2
I  = E
i
E = E
i
; arbitrary
i
= E
I
2
 = E
i
The Bayes bound is a d-1 dimensional hyperplane perpendicular
to the line that passes through both means.
) ( ln
2
) (
2
2
i i
w p g +
÷
÷ =

i
x
x
same priors same priors
I
2
 = E
i
E = E
i
) ( ln ) ( ) (
2
1
) (
1
i
T
i
w p g + ÷ E ÷ ÷ =
÷
  x x x
Mahalanobis Mahalanobis
Homoscedastic Homoscedastic: :
8 8
• In the 2-class case, the decision surface is
hyperspheres, hyperhyperboloids, etc.).
• These decision boundaries may not be
connected.
• Any hyperquadric can be given
(represented) by two Gaussian distributions.
arbitrary
i
= E
Heterodscedastic Heterodscedastic
arbitrary
i
= E
Project #1
1. Implement these three cases using Matlab
(see pp. 36-41 for details). 2D and/or 3D
plots.
2. Generalize the algorithm to more than two
classes – Gaussians).
3. Simulate different Gaussians and distinct
priors.
Bayes Is Optimal
• If our goal is to minimize the classification
error, then Bayes is optimal (you cannot do
better than Bayes – ever).
• In general, if , it is
preferable to classify x in so that the
smallest integral contributes to the error
(see next slide) => This is what Bayes does.
• There is no possible smaller error.
) ( ) | ( ) ( ) | (
2 2 1 1
w P w p w P w p x x >
1
w
} }
+ =
= e + e =
1 2
) ( ) | ( ) ( ) | (
) , ( ) , ( ) (
2 2 1 1
2 1 1 2
R R
d w P w p d w P w p
w R P w R P error P
x x x x
x x
Bayes Bayes
9 9
The multiclass case:
. ) ( ) | (
) , ( ) (
1
1
¿
}
¿
=
=
=
= e =
C
i
R
i i
C
i
i i
i
d w P w p
w R P correct P
x x
x
• Bayes yields the smallest error. But which is the
actual error? The above equation cannot be readily
computed, because the regions R
i
may be very
complex.
Error Bounds: How to calculate
the error?
• Several approximations are easier to
compute (usually upper bounds):
– Chernoff bound.
– Bhattacharyya bound (assumes pdf are
homoscedastic).
• These bound can only be applied to the 2-
class case only.
Chernoff Bound
• For this, we need an integral eq. that we can
solve. For example,
where
}
·
· ÷
= , ) ( ) | ( ) ( x x x d p error P error P
¹
´
¦
=
. decide we if ), | (
decide we if ), | (
) | (
1 2
2 1
w w P
w w P
error P
x
x
x
• Or, we can also write:
• Since, it is known that
we can now write:
• If the conditional probabilities are normal,
we can solve this analytically:
( ) x x x d w P w p w P w p error P ) ( ) | ( ), ( ) | ( min ) (
2 2 1 1 }
=
, ) , min(
1 s s
b a b a
÷
s
, 1 0 s s s
}
÷ ÷
s x x x d w p w p w P w P error P
s s s s
) | ( ) | ( ) ( ) ( ) (
2
1
1 2
1
1
where
, ) | ( ) | (
) (
2
1
1
s k s s
e d w p w p
÷ ÷
=
}
x x x
( ) ( ) ( )
.
) 1 (
ln
2
1
) 1 (
2
) 1 (
) (
1
2 1
2 1
1 2
1
2 1 1 2
s s
T
s s
s s
s s
s k
÷
÷
E E
E ÷ + E
+
+ ÷ E ÷ + E ÷
÷
=    
Bhattacharyya Bound
• When the data is homoscedastic, ,
the optimal solution is s=1/2.
• This is the Bhattacharyya bound.
• A tighter bound is the asymptotic nearest
neighbor error, which is derived from:
. ) ( ) | ( ) ( ) | (
) (
) ( ) | ( ) ( ) | (
2 ) (
2 2 1 1
2 2 1 1
x x x
x
x
x x
d w P w p w P w p
d
p
w P w p w P w p
error p
}
}
s
s
2 1
E = E
10 10
Closing Notes Closing Notes
• Bayes is important because it minimizes the
probability of error. In that sense we say it’s
optimal.
• Unfortunately, Bayse assumes that the conditional
densities and priors are known (or can be
estimated); which is not necessarily true.
• In general, not even the form of these probabilities
is known.
• Most PR approaches attempt to solve these
shortcomings. This is, in fact, what most of PR is
On the + side: a simple example
• We want to predict whether a student will
pass or not a test.
• Y=1 denotes pass. Y=0 failure.
• The observation is a single random variable
X which specifies the hours of study.
• Let . Then:
x c
x
x X Y P
+
= = = ) | 1 (
¹
´
¦ > > = =
=
otherwise 0
i.e. ; 2 / 1 ) | 1 ( if 1
) (
c x x X Y P
x g
Optional homework
• Using Matlab generate n observations of
P(Y=1|X=x) and P(Y=0|X=x).
• Approximate each using a Gaussian
distribution.
• Calculate the Bayes decision boundary and
classification error.
• Select several arbitrary values for c and see
how well you can approximate them.
Hints
• Error = min [P(Y=1|X=x), P(Y=0|X=x)].

 } : 1 a   ( | w ) p( w j 1 i j j | x) Bayes decision rule Conditional risk: R ( | x)  i A simple example Notation: Notation:  i  ( | w j ) p( w j | x) j 1 c •Two-class classifier:   ( | w j )  i ij Bayes decision rule (Bayes risk): (Bayes R( | x)  11 p ( w1 | x)  12 p( w2 | x)   1 R( | x)  21 p ( w1 | x)  22 p ( w2 | x)   2 arg min i R( | x) i •The resulting minimum overall risk is called the Bayes risk. we will use the above notation.Multiple random variables •To be mathematically precise. classes: {w1 . because this probability density function depends on a single random variable X....4) Loss function & decision risk •States exactly how costly each action is.. one should write p X (x | wi ) instead of p (x | wi ) . •In general there is no need for distinction (e. 1 1 2 (  11 ) p ( w1 | x)    22 ) p ( w2 | x)  ( 12  21 •Applying Bayes: p(x | w1 )   22 P ( w2 )   12 . •This allows for the study of PR problems form a geometric point of view. p (x | w2 )   11 P( w1 )  21 threshold Feature Space: Geometry •When x d. and is used to convert a probability determination into a decision. •Sometimes. we have our d-dimensional feature space.... p X & pY ). (see Appendix A. wc } : loss function:  | w j ) : ( i the cost (risk) of going from w j to  (risk) i Conditional Risk: R ( | x)  i c actions: {.. Shall this arise. •Decision rule:  (or w1 ) if R( | x) R( | x)..u a w ’ se i l many other alternatives exists.g. this feature space is considered tb a E cda saebts e le o e n ul en pc. This is key to many algorithms. threshold 2 .

. R1 . . Key: the effect of any decision rule is to divide the feature space into decision regions. •Geometry (key point): the goal is to divide the feature space into c decision regions. we usually attempt to minimize the worst overall risk.Discriminant functions •We can construct a set of discriminant functions: g i (x) . 3 . •Two approaches for that are: the minimax and the Neyman-Pearson criteria. •Classification is also known as hypothesis testing. •If errors are to be minimized. Rc ..error(zero-one loss) (zero- •If we use Bayes & minimum-error-rate classification. c •We classify a feature vector as wi if: g i (x) g j (x).. i =1…. we get: g i (x) p( wi | x)   c p(x | wi ) P( wi ) p (x | w j ) P ( w j ) constant j 1 •Sometime we find more convenient to write this ln ln equation as: g i (x)  p (x | wi )  P( wi ). •In this case.   j i R i •The Bayes classifier is: g i (x)  ( | x). “y bl”er eti Sm oc r e n tn i p s ao Other criterion •In some applications the priors are not known.. one needs to minimize the probability of error: 0  i j  | w j )  ( i  1  i j minimum-error-rate minimum..

225-230) * * = = Univariate case •The Gaussian distribution is: p ( x)   1    1 x  exp    ~ .. variable converges to a normal distribution.. (see Stark & Woods pp.. When n   the standardized random . Central Limit theorem (simplified) Assume that the random variables X1 .Normal Distributions & Bayes •So far we have used p(x | wi ) and P( wi ) to specify the decision boundaries of a Bayes classifier. •The Normal distribution is the most typical PDF for p(). 2  N (  ) 2   2      2 Multivariate case (d>1) 1 1  p ( x)   ( x  )T 1 ( x  )  N (     ~ . each with finite mean and variance. ) 1/ 2   (2 d / 2   2 ) E ( x)  ( x)dx xp     E (( x  ) 2 )   ) 2 p ( x ) dx  (x  2    4 .. •Recall the central limit theorem. X n are iid.

this distance is called is Mahalanobis distance. I •If  then the above equation becomes the Euclidean (norm 2) distance. I.Distances •The general distance in a space is given by: Example (2D Normals) d 2  x  )T 1 ( x  ) (    where  the covariance matrix of the is distribution (or data). the sample autocorrelation matrix:  1 n S   i xT . •Usually we solve that by using the sample mean: n ˆ 1 (   x i ˆ x i ˆ .  )( T ) n i 1 •This is the sample covariance matrix. •The first moment is the sample mean. 5 . as: Y  TX A AT ΣA  I •The mean is then: M Y E ( Y) AT M X •The cov. Whitening Transformation •Recall. Linear transformations •An n-dimensional vector X can be transformed linearly to another. x i n i 1 Central moments •The variance and the covariance matrix are special cases. Heteroscedastic Homoscedastic Moments of the estimates •In statistics the estimates are generally known as the moments of the data. •The second. because they depend on the mean of the data which is unknown.: Σ  T Σ A A X Y •The order of the distances in the transformed space is identical to the one in the original space. it is sometime convenient to represent the data in a space where its sample covariance matrix equals the identity matrix. •If  Normal. Y.

Orthonormal transformation Φ ΦΛ •Eigenanalysis: Σ Φ Φ1 .. Euclidean distances are not preserved. 1 2 •Homework: find the algorithm. •Assume Σ and Σ are two covariance 1 2 matrices. ΦT Φ I & ΦT Φ) Φ X Y Whitening •To obtain a covariance matrix equal to the identity matrix we can apply the orthogonal transformation first and then normalize the 1 result with Λ / 2 : Y   / 2ΦT X  T X 1 ΦX  Σ  T Σ ΦX  / 2ΦT Σ Φ1/ 2  ΦX X 1 I Y X -----------  Properties •Whitening transformations are not orthogonal transformations. 6 . •After whitening. the covariance matrix is invariant to any orthogonal transformation: T I  T   .. •Our goal is to have:A T ΣA I and A T Σ A Λ2 . Φp ] [ •Eigenvectors: •Eigenvalues: 1  0  Λ   0      p •The trasformation is then: Y  T X Φ 1 Σ  T Σ Φ Λ (recall..  I Simultaneous diagonalization •It is usually the case where two or more covariance matrices need to be diagonal. •Therefore..

g i (x) ln p(x | wi )  p (wi ).  i –The most complex case is when  arbitrary .e. •Some algorithms require whitening to have certain poe i ( e le t saeit cus) rpre w ’ seh lt nh or . we might be able to assume   2I. •The general distance => a simple Euclidean distance. •Whitened data is invariant to other orthogonal transformations. N ( . for the Normal density. i. heteroscedastic. The Bayes bound is a d-1 dimensional hyperplane perpendicular to the line that passes through both means. ts l i tr e e Discriminant Functions for Normal PDFs ln •The discriminant function. is: i i   2I  i same priors 1 d 1 g i (x)  (x  i )T 1 (x  i )  ln 2 ln   p ( wi )   i  ln i 2 2 2 •Possible scenarios (or assumptions): –Sometimes.  i –A more general case is when all covariance matrices are identical   .Some advantages •Algorithms usually become much simpler after diagonalization or whitening. homoscedastic. g i ( x)   x i  22 2  p (wi ) ln   2I  i Homoscedastic: Homoscedastic:   i 1 g i (x)  (x  )T 1 (x  )  p( wi )     ln 2 Mahalanobis 7 . i that is. ) .

Heterodscedastic  arbitrary i •In the 2-class case. •In general. if p(x | w ) P( w ) p (x | w ) P( w ) . etc.  arbitrary i Project #1 1. •Any hyperquadric can be given (represented) by two Gaussian distributions. hyperhyperboloids. it is preferable to classify x in w1 so that the smallest integral contributes to the error (see next slide) => This is what Bayes does. •These decision boundaries may not be connected. •There is no possible smaller error. Simulate different Gaussians and distinct priors. then Bayes is optimal (you cannot do better than Bayes – ever). Implement these three cases using Matlab (see pp. Generalize the algorithm to more than two classes – Gaussians). the decision surface is an hyperquadric (e. P (error ) P( x R2 .).g. hyperplanes. hyperspheres. 2. 3. w1 )  (x R1 . 36-41 for details). 2D and/or 3D plots. w2 )  P Bayes Is Optimal •If our goal is to minimize the classification error. 1 1 2 2  x | w1 ) P ( w1 )dx  p (x | w2 ) P( w2 )dx p(  R2 R1 Bayes 8 .

p (x | w2 ) P( w2 ) x min p d s a s 1 •Since. if we decide w2 P (error | x)   P  ( w2 | x).   x | wi ) P( wi ) dx. •These bound can only be applied to the 2class case only. which is derived from: p(x | w1 ) P ( w1 ) p( x | w2 ) P( w2 ) p(error )   2 dx p ( x)  p( x | w1 ) P ( w1 ) p( x | w2 ) P ( w2 ) dx. Chernoff Bound •For this.  p( i  Ri 1 • Bayes yields the smallest error. we can also write: P (error )  ( x | w1 ) P( w1 ).  2  1 the optimal solution is s=1/2. because the regions Ri may be very complex. we need an integral eq. But which is the actual error? The above equation cannot be readily computed. For example. •A tighter bound is the asymptotic nearest neighbor error.  where s (1  ) s T  1 2  1  ) 2  k ( s)   1 s 1 ( s 2  1  2 ( s 1 s 1  )  ln 1 s 1 2 . •This is the Bhattacharyya bound. –Bhattacharyya bound (assumes pdf are homoscedastic). wi )   i 1 C C Error Bounds: How to calculate the error? •Several approximations are easier to compute (usually upper bounds): –Chernoff bound. we can solve this analytically: 1 s  (s) k ps  (x | w1 ) p (x | w2 ) dx e . P(    •Or. we can now write: s 1 s s P (error ) P s ( w1 ) P1 ( w2 )  ( x | w1 ) p1 (x | w2 ) dx ps where P  ( w1 | x). 0   . Bhattacharyya Bound •When the data is homoscedastic. if we decide w1 .The multiclass case: P(correct )  P( x Ri . . •If the conditional probabilities are normal. P(error )  error | x) p ( x) dx. that we can solve. s 2   1 2 9 . it is known that min( a. b)  b .

•In general. •Approximate each using a Gaussian distribution. x  g ( x)   0 otherwise  Optional homework •Using Matlab generate n observations of P(Y=1|X=x) and P(Y=0|X=x). •Calculate the Bayes decision boundary and classification error. This is. P(Y=0|X=x)]. Y=0 failure. •The observation is a single random variable X which specifies the hours of study. in fact.e. i. which is not necessarily true.Closing Notes •Bayes is important because it minimizes the poait o e o. what most of PR is all about. •Y=1 denotes pass. x 1 •Let P(Y  | X x) c x . Then: 1 1 1 c  if P (Y  | X x )  / 2. •Most PR approaches attempt to solve these shortcomings. On the + side: a simple example •We want to predict whether a student will pass or not a test. Bayse assumes that the conditional densities and priors are known (or can be estimated). not even the form of these probabilities is known. 10 . •Plot the original distribution to help you. •Select several arbitrary values for c and see how well you can approximate them. •Unfortunately. Hints •Error = min [P(Y=1|X=x).nhtes w syt rbb i f r rI t sne e a is ly r a ’ optimal.