Lecture 3

Review the fundamentals of probability
• Definition of probability
• Rules of probability
• Bayes’ rule
Pdfs of real-valued quantities
Pdf characterisations
• Expectations, Covariance matrices
Gaussian distributions
When are probabilities used
Probabilities are used to describe two quantities
• Frequencies of outcomes in random experiments.
– The probability of a coin toss landing as tails is
1
2
. Then if
the coin is tossed n times and k
n
“tails” are observed, it is
expected
k
n
n

1
2
as n →∞.
• Degree of belief in propositions not involving random variables.
– the probability that Mr S. was the murderer of Mrs S. given the
evidence
– the probability that this image contains a car given a calculated
feature vector.
Defining probability
Define a probabilistic ensemble with a triple (x, A
X
, P
X
), where
x is the outcome of a random variable, X, and takes on one of
a set of possible values, A
X
= (a
1
, a
2
, . . . , a
I
), having probabilities
P
X
= (p
1
, p
2
, . . . , p
I
) with P(X = a
i
) = p
i
.
The following must be satisfied:
• p
i
≥ 0 for i = 1, . . . , I


x∈A
X
P(X = x) = 1.
A simple example
Let x be the outcome of throwing an unbiased die, then
A
X
= {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘}
P
X
=
_
1
6
,
1
6
,
1
6
,
1
6
,
1
6
,
1
6
_
Question:
P(x = ‘3‘) = ?
P(x = ‘5‘) = ?
Definitions of probability
Probability of a subset: If V ⊂ A
X
, then
P(V ) = P(x ∈ V ) =

x∈V
P(x)
Example:
Going back to our die example, let V = {‘2‘, ‘3‘, ‘4‘}, then
P(V ) = P(x = ‘2‘) +P(x = ‘3‘) +P(x = ‘4‘)
=
1
6
+
1
6
+
1
6
=
1
2
The simple example
Throwing an unbiased die
A
X
= {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘}
P
X
=
_
1
6
,
1
6
,
1
6
,
1
6
,
1
6
,
1
6
_
Question:
If V = {‘2‘, ‘3‘}, what is P(V )?
Definitions of probability
Joint probability: X × Y is an ensemble in which an outcome is
an ordered pair (x, y) with x ∈ A
X
= {a
1
, . . . , a
I
} and y ∈ B
Y
=
{b
1
, . . . , b
J
}. Then P(x, y) is the joint probability of x and y.
Example:
Remember the outcome of throwing an unbiased die is described
with
A
X
= {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘}
· ¸¸ .
Possible outcomes
, P
X
= {6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
}
· ¸¸ .
Probability of each outcome
Definitions of probability
Example ctd:
The output of two consecutive independent, T
1
and T
2
throws of an unbiased
die:
Throw 1: A
T
1
= {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘}
P
T
1
= {6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
}
Throw 2: A
T
2
= {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘}
P
T
2
= {6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
, 6
−1
}
Possible outcomes: A
T
1
×T
2
= {(‘1‘, ‘1‘), (‘1‘, ‘2‘), (‘1‘, ‘3‘), (‘1‘, ‘4‘), (‘1‘, ‘5‘), (‘1‘, ‘6‘),
(‘2‘, ‘1‘), (‘2‘, ‘2‘), (‘2‘, ‘3‘), (‘2‘, ‘4‘), (‘2‘, ‘5‘), (‘2‘, ‘6‘),
. . . . . . . . . . . . . . . . . .
(‘6‘, ‘1‘), (‘6‘, ‘2‘), (‘6‘, ‘3‘), (‘6‘, ‘4‘), (‘6‘, ‘5‘), (‘6‘, ‘6‘)}
Probabilities: P
T
1
×T
2
=
_
1
36
,
1
36
, · · · ,
1
36
_
Another example
Scenario:
A person throws an unbiased die. If the outcome is even throw this
die again, otherwise throw a die biased towards ‘3‘ with
P
X
=
_
1
10
,
1
10
,
1
2
,
1
10
,
1
10
,
1
10
_
Questions:
What is the set, A
T
1
×T
2
, of possible outcomes?
What are the values in P
T
1
×T
2
?
Definitions of probability
Marginal probability:
P(x = a
i
) ≡

y∈A
Y
P(x = a
i
, y)
Similarly:
P(y = b
j
) ≡

x∈A
X
P(x, y = b
j
)
Example:
Returning to example modelling the output of two consecutive independent throws
of an unbiased die then
P(t
1
= ‘1‘) =
6

i=1
P(t
1
= ‘1‘, t
2
= ‘i‘) =
6

i=1
1
36
=
1
6
Example
Scenario:
A person throws an unbiased die. If the outcome is even, throw this
die again, otherwise throw a die biased towards ‘3‘ with
P
X
=
_
1
10
,
1
10
,
1
2
,
1
10
,
1
10
,
1
10
_
Question:
Given P(t
1
, t
2
) (ie P
T
1
×T
2
) and the defintion of marginal probability,
calculate P(t
2
) the probability of the output of the second die in this
scenario.
Definitions of probability
Conditional probability:
P(X = a
i
| Y = b
j
) =
P(X = a
i
, Y = b
j
)
P(Y = b
j
)
, if P(Y = b
j
) = 0
Example:
Returning to example modelling the output of two consecutive
independent throws of an unbiased die then
P(t
2
= ‘3‘ | t
1
= ‘1‘) =
P(t
1
= ‘1‘, t
2
= ‘3‘)
P(t
1
= ‘1‘)
=
1
36
1
6
=
1
6
Example
Scenario:
A person throws an unbiased die. If the outcome is even, throw this
die again, otherwise throw a die biased towards ‘3‘ with
P
X
=
_
1
10
,
1
10
,
1
2
,
1
10
,
1
10
,
1
10
_
Question:
Calculate P(t
2
= ‘3‘ | t
1
= ‘1‘) and P(t
2
= ‘3‘ | t
1
= ‘2‘).
Rules of probability
Product Rule: from the definition of the conditional probability
P(x, y) = P(x| y)P(y) = P(y | x)P(x)
Sum/Chain Rule: rewriting the marginal probability definition
P(x) =

y
P(x, y) =

y
P(x| y)P(y)
Bayes’ Rule: from the product rule
P(y|x) =
P(x| y)P(y)
P(x)
=
P(x| y)P(y)

y

P(x| y

)P(y

)
Independence
Independence: Two random variables X, Y are independent if
P(x, y) = P(x) P(y) ∀x ∈ A
X
, ∀y ∈ B
Y
.
This implies that
P(x| y) = P(x) ∀x ∈ A
X
, ∀y ∈ B
Y
and
P(y | x) = P(y) ∀x ∈ A
X
, ∀y ∈ B
Y
.
X and Y are independent is often denoted by X ⊥⊥ Y .
An Example
Problem: Jo has the test for a nasty disease. Let a denote the state
of Jo’s health and b the test results.
a =
_
1 if Jo has the disease,
0 Jo does not have the disease
b =
_
1 if the test is positive,
0 if the test is negative.
The test is 95% reliable, that is
p(b = 1 | a = 1) = .95 p(b = 1 | a = 0) = .05
p(b = 0 | a = 1) = .05 p(b = 0 | a = 0) = .95
The final piece of background information is that 1% of people Jo’s
age and background have the disease.
Jo has the test and the result is positive.
What is the probability Jo has the disease?
Solution: The background information tells us
P(a = 1) = .01, P(a = 0) = .99
Jo would like to know how plausible it is that she has the disease.
This involves calculating P(a = 1 | b = 1) which is the probability of
Jo having the disease given a positive test result.
Applying Bayes’ Rule:
P(a = 1 | b = 1) =
P(b = 1 | a = 1)P(a = 1)
P(b = 1)
=
P(b = 1 | a = 1)P(a = 1)
P(b = 1 | a = 1)P(a = 1) +P(b = 1 | a = 0)P(a = 0)
=
.95 ×.01
.95 ×.01 +.05 ×.99
= .16
Your turn
Scenario: Your friend has two envelopes. One he calls the Win
envelope which has 100 dollars and four beads ( 2 red and 2 blue)
in it. While the other the Lose envelope has three beads ( 1 red and
2 blue) and no money. You choose one of the envelopes at random
and then your friend offers to sell it to you.
Question:
• How much should you pay for the envelope?
• Suppose before deciding you are allowed to draw one bead from
the envelope.
If this bead is blue how much should you pay?
Inference is important
Inference is the term given to the conclusions reached from the basis
of evidence and reasoning.
Most of this course will be devoted to inference of some form.
Some examples:
I’ve got this evidence. What’s the chance that this conclusion is
true?
• I’ve got a sore neck: how likely am I to have meningitis
• My car detector has fired in this image: how likely is it there is a
car in the image?
Inference using Bayes’ rule
In general:
If θ denotes the unknown parameters/decision, D the data and H
denotes the overall hypothesis space, then
p(θ | D, H) =
P(D| θ, H)P(θ|H)
P(D| H)
is written as
posterior =
likelihood × prior
evidence
Bayesian classification
Bayes’ Rule can be expressed as
P(ω
j
| x) =
P(x| ω
j
)P(ω
j
)

N
k=1
P(x| ω
k
)P(ω
k
)
=
P(x| ω
j
)P(ω
j
)
P(x)
where ω
j
is the jth class and x is the feature vector.
A typical decision rule (class assignment)
Choose the class ω
i
with the highest P(ω
i
| x). Intuitively, we will
choose the class that is more likely given feature vector x.
Terminology Each term in the Bayes’ Rule has a special name:
P(ω
i
) −Prior probability of class ω
i
P(ω
i
| x) −Posterior probability of class ω
i
given the observation x
P(x | ω
i
) −Likelihood of observation x given class ω
i
P(x) −Evidence the normalization constant
Bayes Classifier in a nutshell
1. Learn the class conditional distributions for each class ω
j
.
2. This gives P(x| ω
j
)
3. Estimate the prior P(ω) of each class
4. For a new data point x make a prediction with:
ω

= arg max
ω
j
P(x| ω
j
) P(ω
j
)
Step one is know as density estimation. This will be the topic of
several future lectures. We will also be examining the strengths and
weaknesses of the Bayes classifiers.
We don’t live in a purely
discrete world
Continuous random variables
So far have only encountered discrete random variables. But the
outcome x of the random variable can be continuous.
In this case A
X
is an interval or union of intervals such as A
X
=
(−∞, ∞). The notion of probability must also be updated. Now
p(·) denotes the probability density function (pdf). It has the two
properties:
1) p(x) ≥ 0 ∀x ∈ A
X
,
2)
_
x∈A
X
p(x) dx = 1.
a b
x
p(x)
The probability that a continuous random variable x lies between
values a and b (with b > a) is defined to be
P(a < X ≤ b) =
_
b
x=a
p(x) dx
Continuous random variables
An example of a continuous probability distribution function p(·):
x
P(x)
1 0
1
p(x) =
_
1 if 0 ≤ x ≤ 1
0 otherwise
The above is known as the uniform density function.
Continuous random variables
All the previous definitions and rules are the same except that the
summation signs are replaced by integral signs where appropriate.
For example:
Marginal probability
p(x) ≡
_
y∈A
Y
p(x, y) dy
m dimensional random variables
Below is he joint probability density of an ordered pair (X, Y ) and
p(x, y).
Consider also the ordered vector X = (X
1
, X
2
, . . . , X
m
). The
Joint Probability Density for this vector is defined as p(x) or
p(x
1
, x
2
, . . . , x
m
).
Probability of a volume
R
If R is a volume defined in the space of possible outcomes then the
probability of this volume is
P((X
1
, X
2
, . . . , X
m
) ∈ R) =
_ _
· · ·
_
. ¸¸ .
(x
1
,x
2
,...,x
m
)∈R
p(x
1
, x
2
, . . . , x
m
) dx
1
dx
2
. . . dx
m
Marginal pdf
The marginal pdf is used to represent the pdf of a subset of the x

i
s.
p(x
1
, . . . , x
i−1
, x
i+1
, . . . , x
m
) ≡
_
z
p(x
1
, . . . , x
i−1
, x
i
= z, x
i+1
, . . . , x
m
) dz
PDF description
The probability density function, p
X
(·), fully characterizes our
random variable X. The following partially characterize p
X
(·).
Expectation: E[X] =
_
x∈A
X
xp(x) dx = µ
represents the center of mass of the density.
Variance: Var [X] = E
_
(X −E[X])
2
¸
=
_
x∈A
X
(x −µ)
2
p(x) dx
represents the spread about the mean.
Std deviation: std [X] =
_
Var [X], square root of the variance.
Note: E[f(X)] =
_
x∈A
X
f(x) p(x) dx
Partial description
Mean vector
E[X] = (E[X
1
] , E[X
2
] , . . . , E[X
N
])
T
= (µ
1
, . . . , µ
N
)
T
= µ
Mean
Covariance Matrix
Cov [X] = E
_
(X −µ)(X −µ)
T
_
=
_
E[(X
1
−µ
1
)(X
1
−µ
1
)] . . . E[(X
1
−µ
1
)(X
N
−µ
N
)]
.
.
. · · ·
.
.
.
E[(X
N
−µ
N
)(X
1
−µ
1
)] . . . E[(X
N
−µ
N
)(X
N
−µ
N
)]
_
=
_
σ
2
1
· · · c
1N
· · ·
.
.
. · · ·
c
1N
· · · σ
2
N
_
= Σ
Covariance matrix I
The covariance matrix C = {c
jk
} indicates the tendency of each
pair of features (dimensions in a random vector) to vary together,
to co-vary.
The covariance has several important properties
• If X
i
and X
k
tend to increase together, then c
ik
> 0
• If X
i
tends to decrease when X
k
increases, then c
ik
< 0
• If X
i
and X
k
are uncorrelated, then c
ik
= 0
• |c
ik
| ≤ σ
i
σ
k
, where σ
i
is the standard deviation of X
i
• c
ii
= σ
2
i
= Var [X
i
].
Covariance terms can be expressed as
c
ii
= σ
2
i
and c
ik
= ρ
ik
σ
i
σ
k
where ρ
ik
is called the correlation coefficient.
ρ
ik
= −1 ρ
ik
= −
1
2
ρ
ik
= 0 ρ
ik
=
1
2
ρ
ik
= 1
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
17
Covariance matrix (1)
! The covariance matrix indicates the tendency of each pair of features
(dimensions in a random vector) to vary together, i.e., to co-vary*
! The covariance has several important properties
" If x
i
and x
k
tend to increase together, then c
ik
>0
" If x
i
tends to decrease when x
k
increases, then c
ik
<0
" If x
i
and x
k
are uncorrelated, then c
ik
=0
" |c
ik
|!"
i
"
k
, where "
i
is the standard deviation of x
i
" c
ii
= "
i
2
= VAR(x
i
)
! The covariance terms can be expressed as
" where #
ik
is called the correlation coefficient
k i ik ik
2
i ii
c and c " " # " $ $
X
i
X
k
C
ik
=-"
i
"
k
#
ik
=-1
X
i
X
k
C
ik
=-½"
i
"
k
#
ik
=-½
X
i
X
k
C
ik
=0
#
ik
=0
X
i
X
k
C
ik
=+½"
i
"
k
#
ik
=+½
X
i
X
k
C
ik
="
i
"
k
#
ik
=+1
*from http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_home.htm
Covariance matrix II
The covariance matrix can be reformulated as
Σ = E
_
(X−µ)(X−µ)
T
¸
= E
_
XX
T
¸
−µµ
T
= S −µµ
T
with
S = E
_
XX
T
¸
=
_
_
E[X
1
X
1
] . . . E[X
1
X
N
]
.
.
. . . .
.
.
.
E[X
N
X
1
] . . . E[X
N
X
N
]
_
_
S is called the auto-correlation matrix and contains the same
amount of information as the covariance matrix.
The covariance matrix can also be expressed as
Σ = ΓRΓ =
_
_
_
_
σ
1
0 . . . 0
0 σ
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . σ
N
_
_
_
_
_
_
_
_
1 ρ
12
. . . ρ
1N
ρ
12
1 . . . ρ
2N
.
.
.
.
.
.
.
.
.
.
.
.
ρ
1N
ρ
2N
. . . 1
_
_
_
_
_
_
_
_
σ
1
0 . . . 0
0 σ
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . σ
N
_
_
_
_
• A convenient formulation since Γ contains the scales of
the features and R retains the essential information of the
relationship between the features.
• R is the correlation matrix.
Correlation Vs. Independence
• Two random variables X
i
and X
k
are uncorrelated if
E[X
i
X
k
] = E[X
i
] E[X
k
]. Uncorrelated variables are also called
linearly independent.
• Two random variables X
i
and X
k
are independent if
p(x
i
, x
k
) = p(x
i
) p(x
k
) ∀x
i
, x
k
Covariance intuition
σ
x
σ
y
σ
x
= 0.2249, σ
y
= 0.2588
Covariance intuition
Eigenvectors of Σ are the orthogonal directions where there is the
most spread in the underlying distribution.
The eigenvalues indicate the magnitude of the spread.
The Gaussian Distribution
Some Motivation
Where they’re used
• Modelling the class conditional pdfs. Frequently, a Gaussian
distribution is used or is a building block in modelling it.
Why they are so important
• They pop up everywhere.
• Need them to understand the optimal Bayes’ classifier
• Need them to understand neural networks.
• Need them to understand mixture models.
Unit variance Gaussian
The Gaussian distribution with expected value E[X] = 0 and variance
Var [X] = 1 has pdf:
p
X
(x) =
1
_
(2π)
exp
_

x
2
2
_
7
Copyright © Andrew W. Moore Slide 13
Unit variance
Gaussian
!
!
"
#
$
$
%
&
' (
2
exp
2
1
) (
2
!
! "
)
4189 . 1 ) ( log ) ( ] [ ( ' (
*
+
'+ ( !
#! ! " ! " $ %
1 ] Var[ ( $
0 ] [ ( $ &
Copyright © Andrew W. Moore Slide 14
General
Gaussian
!
!
"
#
$
$
%
& '
' (
2
2
2
) (
exp
2
1
) (
,
-
, )
!
! "
2
] Var[ , ( $
! $ & ( ] [
-=100
,=15
General 1D Gaussian
The Gaussian distribution with expected value E[X] = µ and variance
Var [X] = σ
2
has pdf:
p
X
(x) =
1
σ
_
(2π)
exp
_

(x −µ)
2
2 σ
2
_
7
Copyright © Andrew W. Moore Slide 13
Unit variance
Gaussian
!
!
"
#
$
$
%
&
' (
2
exp
2
1
) (
2
!
! "
)
4189 . 1 ) ( log ) ( ] [ ( ' (
*
+
'+ ( !
#! ! " ! " $ %
1 ] Var[ ( $
0 ] [ ( $ &
Copyright © Andrew W. Moore Slide 14
General
Gaussian
!
!
"
#
$
$
%
& '
' (
2
2
2
) (
exp
2
1
) (
,
-
, )
!
! "
2
] Var[ , ( $
! $ & ( ] [
-=100
,=15
Terminology:
Write X ∼ N(µ, σ
2
) to denote:
X is distributed as a Gaussian with mean µ and variance σ
2
.
The error function
If X ∼ N(0, 1) then erf is defined as
erf(x) =
_
x
z=−∞
p(z) dz =
1


_
x
z=−∞
exp
_

z
2
2
_
dz
Cumulative Distribution of X
7
Copyright © Andrew W. Moore Slide 13
Unit variance
Gaussian
!
!
"
#
$
$
%
&
' (
2
exp
2
1
) (
2
!
! "
)
4189 . 1 ) ( log ) ( ] [ ( ' (
*
+
'+ ( !
#! ! " ! " $ %
1 ] Var[ ( $
0 ] [ ( $ &
Copyright © Andrew W. Moore Slide 14
General
Gaussian
!
!
"
#
$
$
%
& '
' (
2
2
2
) (
exp
2
1
) (
,
-
, )
!
! "
2
] Var[ , ( $
! $ & ( ] [
-=100
,=15
8
Copyright © Andrew W. Moore Slide 15
General
Gaussian
!
!
"
#
$
$
%
& '
' (
2
2
2
) (
exp
2
1
) (
)
*
) +
!
! "
2
] Var[ ) ( #
! # $ ( ] [
*=100
)=15
Shorthand: We say X ~ N(*,)
2
) to mean “X is distributed as a Gaussian
with parameters * and )
2
”.
In the above figure, X ~ N(100,15
2
)
Also known
as the normal
distribution
or Bell-
shaped curve
Copyright © Andrew W. Moore Slide 16
The Error Function
Assume X ~ N(0,1)
Define ERF(x) = P(X<x) = Cumulative Distribution of X
,
'- (
(
!
%
&% % " ! $'( ) ( ) (
,
'- (
!
!
"
#
$
$
%
&
' (
!
%
&%
%
2
exp
2
1
2
+
Note if X ∼ N(µ, σ
2
) then P(X < x) = erf
_
x−µ
σ
_
The Central Limit Theorem
Assume X
1
, X
2
, . . . , X
N
are identically and independently distributed
(i.i.d.) continuous random variables.
Define
z = f(x
1
, x
2
, . . . , x
N
) =
1
N
N

i=1
x
i
then
p(z) →N(E[X], Var [X]) as N →∞.
Frequently used as a justification for assuming Gaussian noise.
Illustration
500 experiments were performed using a uniform distribution.
• For N = 1, one sample was drawn from the distribution and
its mean was recorded (for each of the 500 experiments). The
histogram of the result shows a uniform density.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Central Limit Theorem
! The central limit theorem states that given a distribution with a mean !
and variance "
2
, the sampling distribution of the mean approaches a
normal distribution with a mean (!) and a variance "
i
2
/N as N, the
sample size, increases.
" No matter what the shape of the original distribution is, the sampling distribution
of the mean approaches a normal distribution
" Keep in mind that N is the sample size for each mean and not the number of
samples
! A uniform distribution is used to illustrate the
idea behind the Central Limit Theorem
" Five hundred experiments were performed using
am uniform distribution
! For N=1, one sample was drawn from the
distribution and its mean was recorded (for each of
the 500 experiments)
" Obviously, the histogram shown a uniform density
! For N=4, 4 samples were drawn from the
distribution and the mean of these 4 samples was
recorded (for each of the 500 experiments)
" The histogram starts to show a Gaussian shape
! And so on for N=7 and N=10
! As N grows, the shape of the histograms resembles
a Normal distribution more closely
• For N = 4, 4 samples were drawn from the distribution and
the mean of these 4 samples was recorded (for each of the 500
experiments). The histogram starts to show a Gaussian shape.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Central Limit Theorem
! The central limit theorem states that given a distribution with a mean !
and variance "
2
, the sampling distribution of the mean approaches a
normal distribution with a mean (!) and a variance "
i
2
/N as N, the
sample size, increases.
" No matter what the shape of the original distribution is, the sampling distribution
of the mean approaches a normal distribution
" Keep in mind that N is the sample size for each mean and not the number of
samples
! A uniform distribution is used to illustrate the
idea behind the Central Limit Theorem
" Five hundred experiments were performed using
am uniform distribution
! For N=1, one sample was drawn from the
distribution and its mean was recorded (for each of
the 500 experiments)
" Obviously, the histogram shown a uniform density
! For N=4, 4 samples were drawn from the
distribution and the mean of these 4 samples was
recorded (for each of the 500 experiments)
" The histogram starts to show a Gaussian shape
! And so on for N=7 and N=10
! As N grows, the shape of the histograms resembles
a Normal distribution more closely
• Similarly for N = 7
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Central Limit Theorem
! The central limit theorem states that given a distribution with a mean !
and variance "
2
, the sampling distribution of the mean approaches a
normal distribution with a mean (!) and a variance "
i
2
/N as N, the
sample size, increases.
" No matter what the shape of the original distribution is, the sampling distribution
of the mean approaches a normal distribution
" Keep in mind that N is the sample size for each mean and not the number of
samples
! A uniform distribution is used to illustrate the
idea behind the Central Limit Theorem
" Five hundred experiments were performed using
am uniform distribution
! For N=1, one sample was drawn from the
distribution and its mean was recorded (for each of
the 500 experiments)
" Obviously, the histogram shown a uniform density
! For N=4, 4 samples were drawn from the
distribution and the mean of these 4 samples was
recorded (for each of the 500 experiments)
" The histogram starts to show a Gaussian shape
! And so on for N=7 and N=10
! As N grows, the shape of the histograms resembles
a Normal distribution more closely
• Similarly for N = 10
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
21
Central Limit Theorem
! The central limit theorem states that given a distribution with a mean !
and variance "
2
, the sampling distribution of the mean approaches a
normal distribution with a mean (!) and a variance "
i
2
/N as N, the
sample size, increases.
" No matter what the shape of the original distribution is, the sampling distribution
of the mean approaches a normal distribution
" Keep in mind that N is the sample size for each mean and not the number of
samples
! A uniform distribution is used to illustrate the
idea behind the Central Limit Theorem
" Five hundred experiments were performed using
am uniform distribution
! For N=1, one sample was drawn from the
distribution and its mean was recorded (for each of
the 500 experiments)
" Obviously, the histogram shown a uniform density
! For N=4, 4 samples were drawn from the
distribution and the mean of these 4 samples was
recorded (for each of the 500 experiments)
" The histogram starts to show a Gaussian shape
! And so on for N=7 and N=10
! As N grows, the shape of the histograms resembles
a Normal distribution more closely
As N grows the histogram increasingly resembles a Gaussian.
Bivariate Gaussian
Write random variable X =
_
X
Y
_
. Define X ∼ N(µ, Σ) to mean
p(x) =
1
2 π|Σ|
1
2
exp
_

1
2
(x −µ)
T
Σ
−1
(x −µ)
_
where
µ =
_
µ
x
µ
y
_
Σ =
_
σ
2
x
σ
xy
σ
xy
σ
2
y
_
and Σ must be symmetric and non-negative definite.
It turns out for a Gaussian distribution E[X] = µ and Cov [X] = Σ.
General Gaussian
Have a random variable X =
_
X
1
, X
2
, . . . , X
m
_
T
.
Then define X ∼ N(µ, Σ) as
p
X
(x) =
1
(2π)
m
2
|Σ|
1
2
exp
_

1
2
(x −µ)
T
Σ
−1
(x −µ)
_
where
µ =
_
_
_
_
µ
1
µ
2
.
.
.
µ
m
_
_
_
_
, Σ =
_
_
_
_
σ
2
1
σ
12
· · · σ
1m
σ
12
σ
2
2
· · · σ
2m
.
.
.
.
.
.
.
.
.
.
.
.
σ
1m
σ
2m
· · · σ
2
m
_
_
_
_
and Σ must be symmetric non-negative definite.
General Gaussian
µ =
_
_
_
_
µ
1
µ
2
.
.
.
µ
m
_
_
_
_
, Σ =
_
_
_
_
σ
2
1
σ
12
· · · σ
1m
σ
12
σ
2
2
· · · σ
2m
.
.
.
.
.
.
.
.
.
.
.
.
σ
1m
σ
2m
· · · σ
2
m
_
_
_
_
−20 −15 −10 −5 0 5 10 15 20
−15
−10
−5
0
5
10
15
Axis-aligned Gaussian
µ =
_
_
_
_
µ
1
µ
2
.
.
.
µ
m
_
_
_
_
, Σ =
_
_
_
_
_
_
_
σ
2
1
0 · · · 0 0
0 σ
2
2
· · · 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · σ
2
m−1
0
0 0 · · · 0 σ
2
m
_
_
_
_
_
_
_
−25 −20 −15 −10 −5 0 5 10 15 20 25
−15
−10
−5
0
5
10
15
Spherical Gaussian
µ =
_
_
_
_
µ
1
µ
2
.
.
.
µ
m
_
_
_
_
, Σ =
_
_
_
_
_
_
_
σ
2
0 · · · 0 0
0 σ
2
· · · 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · σ
2
0
0 0 · · · 0 σ
2
_
_
_
_
_
_
_
−15 −10 −5 0 5 10 15
−15
−10
−5
0
5
10
15
Degenerate Gaussian
µ =
_
_
_
_
µ
1
µ
2
.
.
.
µ
m
_
_
_
_
, |Σ| = 0
Recap
• Have seen the formulae for Gaussians
• You should have an intuition of how they behave
• Have some confidence in reading a Gaussian’s covariance matrix.
How can we transform a random variable with a non-diagonal
covariance matrix to one with a diagonal covariance matrix?
Eigenvectors and eigenvalues
Given an m×m matrix A.
Definition:
v an eigenvector of A if there exists a scalar λ (eigenvalue) such that
Av = λv
How to compute them:
For an eigenvector-eigenvalue pair of A:
Av = λv ⇒ Av −λv = 0
⇒ (A −λI)v = 0 ⇒
_
v = 0 trivial soln
(A −λI) is rank deficient
For the non-trivial solution when (A −λI) being rank deficient implies
det(A −λI) = 0 ⇒ λ
m
+ a
1
λ
m−1
+· · · + a
m−1
λ + a
m
· ¸¸ .
characteristic equation
= 0
Solve this characteristic equation to obtain possible values for λ and given those
then compute their corresponding eigenvector.
Some terminology:
Matrix formed by the column eigenvectors of A is called the modal matrix M:
M =
_
_
↑ ↑ ↑ ↑
v
1
v
2
v
3
. . . v
m
↓ ↓ ↓ ↓
_
_
Let Λ be the diagonal matrix with A’s eigenvalues on the main diagonal:
Λ =
_
_
_
_
_
_
_
λ
1
λ
2
λ
3
.
.
.
λ
m
_
_
_
_
_
_
_
Matrix Λ is the canonical form of A.
Properties of A and implications for its eigenvalues:
A non-singular ⇒ All eigenvalues are non-zero.
A real and symmetric ⇒ All eigenvalues are real and
Eigenvectors associated with distinct eigenvalues are orthogonal.
A positive definite ⇒ All eigenvalues are positive.
Eigen-decomposition of A
If A has non-degenerate eigenvalues λ
1
, λ
2
, . . . , λ
m
(no two λ
i
, λ
j
have the
same value) and M is A’s modal matrix then
AM = A[v
1
v
2
. . . v
m
]
= [Av
1
Av
2
. . . Av
m
]
= [λ
1
v
1
λ
2
v
2
. . . λ
m
v
m
] = MΛ
=⇒ A = MΛM
−1
← the eigen-decomposition of A
Consequently:
A
2
= MΛ
2
M
−1
,
A
n
= MΛ
n
M
−1
for n = 1, 2, 3, ...
A
−1
= (MΛM
−1
)
−1
= MΛ
−1
M
−1
Interpretation
View matrix A as a linear transformation:
An eigenvector v represents an invariant direction of the
transformation.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
8
Interpretation of eigenvectors and eigenvalues (1)
! If we view matrix A as a linear transformation, an eigenvector represents an
invariant direction in the vector space
" When transformed by A, any point lying on the direction defined by v will remain on that
direction, and its magnitude will be multiplied by the corresponding eigenvalue !
" For example, the transformation which rotates 3-d vectors about the Z axis has vector [0 0 1]
as its only eigenvector and 1 as the corresponding eigenvalue
x
1
x
2
v
P
d
y
1
y
2
v
P
d’=!d
P’
y=Ax
x
y
z
" #
T
1 0 0 v $
%
%
%
&
'
(
(
(
)
* +
$
1 0 0
0 ! cos ! sin
0 ! sin ! cos
A
When transformed by A, any point lying on the direction defined by
v will remain on that direction, and its magnitude will be multiplied
by the corresponding eigenvalue λ.
Back to Covariance Matrices
Given Σ the covariance of a Gaussian distribution
Eigenvectors of Σ are the principal directions of the distribution.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Interpretation of eigenvectors and eigenvalues (2)
! Given the covariance matrix ! of a Gaussian distribution
" The eigenvectors of ! are the principal directions of the distribution
" The eigenvalues are the variances of the corresponding principal directions
! The linear transformation defined by the eigenvectors of ! leads to vectors that
are uncorrelated regardless of the form of the distribution
" If the distribution happens to be Gaussian, then the transformed vectors will be statistically
independent
"
#
$
%
&
'
( ) ( (
)
*
(
µ) (X µ) (X
2
1
exp
!) (2
1
(x) f
1 T
1/2
N/2
X
"
"
#
$
%
%
&
' (
( *
+
* i
2
y i
N
1 i
i
Y
" 2
) µ (y
exp
" ! 2
1
(y) f
i
"
"
"
#
$
%
%
%
&
'
*
"
"
"
#
$
%
%
%
&
'
, , ,
- - -
* *
N
1
N 2 1
"
"
# and v v v M with # M M $ !
y=M
T
x
y
1
y
2
.
y1
.
y2
x
1
x
2
v
1
v
2
.
x1
.
x2
Their eigenvalues are the variances in these principal direction.
Linear transformation defined by the eigenvectors of Σ
Obtain uncorrelated vector regardless of the form of the distribution.
Let M be the modal matrix of the covariance matrix Σ =⇒ ΣM = MΛ
Define the transformation: y = M
T
x
if x ∼ N(µ, Σ) then y ∼ N(µ, Λ) as Σ is a symmetric matrix with real values.
This in turn means y is a Gaussian random variable with diagonal covariance
matrix.
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Interpretation of eigenvectors and eigenvalues (2)
! Given the covariance matrix ! of a Gaussian distribution
" The eigenvectors of ! are the principal directions of the distribution
" The eigenvalues are the variances of the corresponding principal directions
! The linear transformation defined by the eigenvectors of ! leads to vectors that
are uncorrelated regardless of the form of the distribution
" If the distribution happens to be Gaussian, then the transformed vectors will be statistically
independent
"
#
$
%
&
'
( ) ( (
)
*
(
µ) (X µ) (X
2
1
exp
!) (2
1
(x) f
1 T
1/2
N/2
X
"
"
#
$
%
%
&
' (
( *
+
* i
2
y i
N
1 i
i
Y
" 2
) µ (y
exp
" ! 2
1
(y) f
i
"
"
"
#
$
%
%
%
&
'
*
"
"
"
#
$
%
%
%
&
'
, , ,
- - -
* *
N
1
N 2 1
"
"
# and v v v M with # M M $ !
y=M
T
x
y
1
y
2
.
y1
.
y2
x
1
x
2
v
1
v
2
.
x1
.
x2
y=M
T
x
=⇒
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Interpretation of eigenvectors and eigenvalues (2)
! Given the covariance matrix ! of a Gaussian distribution
" The eigenvectors of ! are the principal directions of the distribution
" The eigenvalues are the variances of the corresponding principal directions
! The linear transformation defined by the eigenvectors of ! leads to vectors that
are uncorrelated regardless of the form of the distribution
" If the distribution happens to be Gaussian, then the transformed vectors will be statistically
independent
"
#
$
%
&
'
( ) ( (
)
*
(
µ) (X µ) (X
2
1
exp
!) (2
1
(x) f
1 T
1/2
N/2
X
"
"
#
$
%
%
&
' (
( *
+
* i
2
y i
N
1 i
i
Y
" 2
) µ (y
exp
" ! 2
1
(y) f
i
"
"
"
#
$
%
%
%
&
'
*
"
"
"
#
$
%
%
%
&
'
, , ,
- - -
* *
N
1
N 2 1
"
"
# and v v v M with # M M $ !
y=M
T
x
y
1
y
2
.
y1
.
y2
x
1
x
2
v
1
v
2
.
x1
.
x2
Manipulations of Normally
Distributed random variables
Linear transforms remain Gaussian
Assume X is an m−dimensional Gaussian random variable
X ∼ N(µ, Σ)
Define Y to be a p−dimensional random variable with
Y = AX
where A is a p ×m matrix. Then
Y ∼ N(Aµ, AΣA
T
)
Gaussian marginals are Gaussian
Let X = (X
1
, X
2
, . . . , X
m
)
T
be a multivariate Gaussian random
variable and define subsets of its variables as follows
X =
_
U
V
_
with U =
_
_
X
1
.
.
.
X
p
_
_
and V =
_
_
X
p+1
.
.
.
X
m
_
_
If
_
U
V
_
∼ N
__
µ
u
µ
v
_
,
_
Σ
uu
Σ
uv
Σ
T
uv
Σ
vv
__
Then U is also distributed as a Gaussian with U ∼ N(µ
u
, Σ
uu
).
How do we prove this using the result on the previous slide?
Adding 2 independent Gaussians
If X ∼ N(µ
X
, Σ
X
) and Y ∼ N(µ
Y
, Σ
Y
) and X ⊥⊥ Y then
X+Y ∼ N(µ
X

Y
, Σ
X
+ Σ
Y
)
If X and Y are not independent but uncorrelated the above result
does not hold.
Conditional of a Gaussian
Assume that
_
U
V
_
∼ N
__
µ
u
µ
v
_
,
_
Σ
uu
Σ
uv
Σ
T
uv
Σ
vv
__
then U| V ∼ N
_
µ
u|v
, Σ
u|v
_
where
µ
u|v
= µ
u
+ Σ
T
uv
Σ
−1
vv
(V−µ
v
)
Σ
u|v
= Σ
uu
−Σ
T
uv
Σ
−1
vv
Σ
uv
Conditional of a Gaussian
Consider the marginal mean:
µ
u|v
= µ
u
+ Σ
T
uv
Σ
−1
vv
(V−µ
v
)
When V = µ
v
⇒ the conditional mean of U is µ
u
Marginal mean is a linear function of V.
Consider the conditional covariance:
Σ
u|v
= Σ
uu
−Σ
T
uv
Σ
−1
vv
Σ
uv
Conditional variance is ≤ than the marginal variance.
Conditional varaince is independent of the given value of V.
Gaussians and the chain rule
Let A be a constant matrix if
U| V ∼ N(AV, Σ
u|v
) and V ∼ N(µ
v
, Σ
vv
)
then
_
U
V
_
∼ N(µ, Σ)
with
µ =
_

v
µ
v
_
and Σ =
_

vv
A
T
+ Σ
u|v

vv
(AΣ
vv
)
T
Σ
vv
_
Are these manipulations useful?
Bayesian inference
Consider the following problem:
In the world as a whole, IQs are drawn from a Gaussian N(100, 15
2
).
If you take an IQ test you’ll get a score that, on average (over many
tests) will be your IQ. Because of noise on any one test the score
will often be a few points lower or higher than your true IQ. Thus
assume we have the conditional distribution
Score | IQ ∼ N(IQ, 10
2
)
If you take the IQ test and get a score of 130 what is the most likely
value of your IQ given this piece of evidence?
IQ Example
Which distribution should we calculuate?
How can we get an expression for this distribution from the
distributions of Score | IQ and IQ and the manipulations we have
described?
Plan
This we know:
IQ ∼ N(100, 15
2
), Score | IQ ∼ N(IQ, 15
2
), Score = 130
Want to find the distribution p(IQ| Score = 130).
Plan:
• Use the chain rule to compute the distribution of
_
Score
IQ
_
from
distributions of IQ and Score | IQ
• Swap the order of the random variables to get
_
IQ
Score
_
’s
distribution
• From
_
IQ
Score
_
compute the conditional distribution to get the
distribution of IQ| (Score = 130)
What is the best estimate for test taker’s IQ?
Today’s assignment
Pen & Paper assignment
• Details available on the course website.
• You will be asked to perform some simple Bayesian reasoning.
• Mail me about any errors you spot in the Exercise notes.
• I will notify the class about errors spotted and corrections via the
course website and mailing list.
Consider this..
Mrs S. is found stabbed in the garden of her family home in the
USA. Her husband Mr. S. behaves strangely after her death and is
considered as a suspect.
On investigation the police discover that Mr S. had beaten up his
wife on at least nine previous occasions. The prosecution offers up
this information as evidence in favour of the hypothesis that Mr S is
guilty of the murder.
However, Mr S.’s lawyer disputes this by saying ”statistically, only
one in a thousand wife-beaters actually goes on to murder his wife.
So the wife beating is not strong evidence at all. In fact given the
wife-beating evidence alone, it’s extremely unlikely that he would
be the murderer of his wife - only a 1/1000 chance. You should
therefore find him innocent.”
The prosecution replies with these two following empirical facts. In
the USA it is estimated that 2 million woman are abused each year
by their partners. (Let’s assume there are 100 million adult women in
the USA). In 1994, 4739 women were victims of homicide, of those,
1326 women (28%) were killed by husbands and boyfriends.
Question:
Is the lawyer right to imply that the history of wife-beating does not
point to Mr S.’s being the murderer? Or is the lawyer’s reasoning
flawed? If the latter, what is wrong with his argument?

When are probabilities used
Probabilities are used to describe two quantities • Frequencies of outcomes in random experiments.
1 – The probability of a coin toss landing as tails is 2 . Then if the coin is tossed n times and kn “tails” are observed, it is expected kn → 1 as n → ∞. n 2

• Degree of belief in propositions not involving random variables. – the probability that Mr S. was the murderer of Mrs S. given the evidence – the probability that this image contains a car given a calculated feature vector.

Defining probability
Define a probabilistic ensemble with a triple (x, AX , PX ), where x is the outcome of a random variable, X, and takes on one of a set of possible values, AX = (a1, a2, . . . , aI ), having probabilities PX = (p1, p2, . . . , pI ) with P (X = ai) = pi. The following must be satisfied: • pi ≥ 0 for i = 1, . . . , I

x∈AX

P (X = x) = 1.

A simple example
Let x be the outcome of throwing an unbiased die, then AX = {‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘, ‘6‘} PX = 1 1 1 1 1 1 , , , , , 6 6 6 6 6 6

Question: P (x = ‘3‘) = ? P (x = ‘5‘) = ?

‘4‘}. then P (V ) = P (x ∈ V ) = x∈V P (x) Example: Going back to our die example. then P (V ) = P (x = ‘2‘) + P (x = ‘3‘) + P (x = ‘4‘) = 1 1 1 1 + + = 6 6 6 2 . ‘3‘. let V = {‘2‘.Definitions of probability Probability of a subset: If V ⊂ AX .

. ‘4‘. ‘5‘. what is P (V )? . . ‘3‘.The simple example Throwing an unbiased die AX = {‘1‘. ‘6‘} PX = 1 1 1 1 1 1 . ‘3‘}. 6 6 6 6 6 6 Question: If V = {‘2‘. ‘2‘. . .

y) is the joint probability of x and y. y) with x ∈ AX = {a1. ‘4‘. .6 −1 . Example: Remember the outcome of throwing an unbiased die is described with AX = {‘1‘. . Then P (x.Definitions of probability Joint probability: X × Y is an ensemble in which an outcome is an ordered pair (x. ‘6‘}. bJ }. ‘3‘. . . . Possible outcomes PX = {6 −1 . . . .6 −1 . ‘5‘. ‘2‘. aI } and y ∈ BY = {b1.6 −1 .6 −1 .6 −1 } Probability of each outcome .

‘4‘. ‘4‘. ‘4‘). (‘6‘. 6−1.. ‘1‘). ‘3‘. ‘2‘). 6−1} AT2 = {‘1‘. ‘6‘). (‘6‘. (‘2‘.. (‘1‘.. (‘2‘. ‘5‘). ‘5‘). Throw 2: Possible outcomes: . (‘2‘... 6−1.. 6−1... ‘2‘. ‘6‘} PT1 = {6−1. ‘3‘). ‘6‘). ‘3‘. (‘2‘. ‘3‘). 6−1.Definitions of probability Example ctd: The output of two consecutive independent. ‘6‘)} Probabilities: PT1×T2 = 1 1 36 . ‘1‘). 36 ... ‘5‘). 6−1. ‘6‘} PT2 = {6−1. (‘6‘. ‘2‘).. 36 . ‘2‘).. (‘2‘.. ‘4‘). · · · 1 . ‘5‘. (‘6‘. (‘1‘.. 6−1. ‘3‘). (‘6‘.. 6−1. T1 and T2 throws of an unbiased die: Throw 1: AT1 = {‘1‘. ‘1‘). 6−1} AT1 ×T2 = {(‘1‘. (‘2‘. 6−1.. (‘1‘. (‘6‘. ‘4‘). ‘2‘.. ‘5‘. (‘1‘. (‘1‘.

Another example Scenario: A person throws an unbiased die. of possible outcomes? What are the values in PT1×T2 ? . AT1×T2 . . If the outcome is even throw this die again. otherwise throw a die biased towards ‘3‘ with PX = 1 1 1 1 1 1 . . 10 10 2 10 10 10 Questions: What is the set. . .

y = bj ) Example: Returning to example modelling the output of two consecutive independent throws of an unbiased die then 6 6 P (t1 = ‘1‘) = i=1 P (t1 = ‘1‘.Definitions of probability Marginal probability: P (x = ai) ≡ y∈AY P (x = ai. y) Similarly: P (y = bj ) ≡ x∈AX P (x. t2 = ‘i‘) = i=1 1 1 = 36 6 .

10 10 2 10 10 10 Question: Given P (t1. .Example Scenario: A person throws an unbiased die. . . . otherwise throw a die biased towards ‘3‘ with PX = 1 1 1 1 1 1 . . throw this die again. t2) (ie PT1×T2 ) and the defintion of marginal probability. calculate P (t2) the probability of the output of the second die in this scenario. If the outcome is even.

Definitions of probability Conditional probability: P (X = ai. Y = bj ) P (X = ai | Y = bj ) = . t2 = ‘3‘) = P (t2 = ‘3‘ | t1 = ‘1‘) = P (t1 = ‘1‘) 1 36 1 6 = 1 6 . if P (Y = bj ) = 0 P (Y = bj ) Example: Returning to example modelling the output of two consecutive independent throws of an unbiased die then P (t1 = ‘1‘.

10 10 2 10 10 10 Question: Calculate P (t2 = ‘3‘ | t1 = ‘1‘) and P (t2 = ‘3‘ | t1 = ‘2‘). otherwise throw a die biased towards ‘3‘ with PX = 1 1 1 1 1 1 . If the outcome is even. . . . .Example Scenario: A person throws an unbiased die. throw this die again. .

y) = y P (x | y)P (y) Bayes’ Rule: from the product rule P (y|x) = P (x | y)P (y) = P (x) P (x | y)P (y) y P (x | y )P (y ) . y) = P (x | y)P (y) = P (y | x)P (x) Sum/Chain Rule: rewriting the marginal probability definition P (x) = y P (x.Rules of probability Product Rule: from the definition of the conditional probability P (x.

∀y ∈ BY and P (y | x) = P (y) ∀x ∈ AX . X and Y are independent is often denoted by X ⊥ Y . ∀y ∈ BY . ∀y ∈ BY .Independence Independence: Two random variables X. y) = P (x) P (y) ∀x ∈ AX . Y are independent if P (x. This implies that P (x | y) = P (x) ∀x ∈ AX . ⊥ .

95 The final piece of background information is that 1% of people Jo’s age and background have the disease.An Example Problem: Jo has the test for a nasty disease.05 p(b = 0 | a = 0) = . that is p(b = 1 | a = 1) = . The test is 95% reliable. if the test is negative. b= Jo does not have the disease 1 0 if the test is positive.05 p(b = 1 | a = 0) = .95 p(b = 0 | a = 1) = . . Jo has the test and the result is positive. a= 1 0 if Jo has the disease. Let a denote the state of Jo’s health and b the test results.

01. This involves calculating P (a = 1 | b = 1) which is the probability of Jo having the disease given a positive test result. P (a = 0) = . .99 Jo would like to know how plausible it is that she has the disease.What is the probability Jo has the disease? Solution: The background information tells us P (a = 1) = .

99 .01 = .16 .05 × .Applying Bayes’ Rule: P (a = 1 | b = 1) = = = P (b = 1 | a = 1)P (a = 1) P (b = 1) P (b = 1 | a = 1)P (a = 1) P (b = 1 | a = 1)P (a = 1) + P (b = 1 | a = 0)P (a = 0) .01 + .95 × .95 × .

Question: • How much should you pay for the envelope? • Suppose before deciding you are allowed to draw one bead from the envelope. You choose one of the envelopes at random and then your friend offers to sell it to you. While the other the Lose envelope has three beads ( 1 red and 2 blue) and no money.Your turn Scenario: Your friend has two envelopes. One he calls the Win envelope which has 100 dollars and four beads ( 2 red and 2 blue) in it. If this bead is blue how much should you pay? .

Most of this course will be devoted to inference of some form. What’s the chance that this conclusion is true? • I’ve got a sore neck: how likely am I to have meningitis • My car detector has fired in this image: how likely is it there is a car in the image? . Some examples: I’ve got this evidence.Inference is important Inference is the term given to the conclusions reached from the basis of evidence and reasoning.

H) = is written as P (D | θ. H)P (θ|H) P (D | H) likelihood × prior posterior = evidence . then p(θ | D. D the data and H denotes the overall hypothesis space.Inference using Bayes’ rule In general: If θ denotes the unknown parameters/decision.

Intuitively. we will choose the class that is more likely given feature vector x. .Bayesian classification Bayes’ Rule can be expressed as P (ωj | x) = P (x | ωj )P (ωj ) N k=1 P (x | ωk )P (ωk ) = P (x | ωj )P (ωj ) P (x) where ωj is the jth class and x is the feature vector. A typical decision rule (class assignment) Choose the class ωi with the highest P (ωi | x).

Terminology Each term in the Bayes’ Rule has a special name: P (ωi) − Prior probability of class ωi P (ωi | x) − Posterior probability of class ωi given the observation x P (x | ωi) − Likelihood of observation x given class ωi P (x) − Evidence the normalization constant .

This will be the topic of several future lectures. We will also be examining the strengths and weaknesses of the Bayes classifiers. Learn the class conditional distributions for each class ωj . . 2. For a new data point x make a prediction with: ω ∗ = arg max P (x | ωj ) P (ωj ) ωj Step one is know as density estimation.Bayes Classifier in a nutshell 1. Estimate the prior P (ω) of each class 4. This gives P (x | ωj ) 3.

We don’t live in a purely discrete world .

.Continuous random variables So far have only encountered discrete random variables. But the outcome x of the random variable can be continuous. ∞). The notion of probability must also be updated. It has the two properties: 1) p(x) ≥ 0 ∀x ∈ AX . Now p(·) denotes the probability density function (pdf). In this case AX is an interval or union of intervals such as AX = (−∞. 2) x∈AX p(x) dx = 1.

p(x) x a b The probability that a continuous random variable x lies between values a and b (with b > a) is defined to be b P (a < X ≤ b) = x=a p(x) dx .

.Continuous random variables An example of a continuous probability distribution function p(·): P(x) 1 x 0 1 p(x) = 1 0 if 0 ≤ x ≤ 1 otherwise The above is known as the uniform density function.

For example: Marginal probability p(x) ≡ y∈AY p(x.Continuous random variables All the previous definitions and rules are the same except that the summation signs are replaced by integral signs where appropriate. y) dy .

Consider also the ordered vector X = (X1. . X2. . . The . Xm).m dimensional random variables Below is he joint probability density of an ordered pair (X. y). . Y ) and p(x.

Joint Probability Density for this vector is defined as p(x) or p(x1. . Probability of a volume R If R is a volume defined in the space of possible outcomes then the . . x2. . . xm).

.probability of this volume is P ((X1. .xm )∈R p(x1.. x2. . X2.. xm) dx1 dx2 . . .. dxm . .x2 .. . Xm) ∈ R) = ··· (x1 . . . .

xi−1. . . xi+1.Marginal pdf The marginal pdf is used to represent the pdf of a subset of the xis. xm) ≡ p(x1. . . . p(x1. . xi+1. xi−1. xi = z. xm) dz z . . . . . . . . . . .

square root of the variance. Variance: Var [X] = E (X − E [X])2 = x∈AX (x − µ)2 p(x) dx represents the spread about the mean. The following partially characterize pX (·). Expectation: E [X] = x∈AX x p(x) dx = µ represents the center of mass of the density. Std deviation: std [X] = x∈AX Var [X].PDF description The probability density function. Note: E [f (X)] = f (x) p(x) dx . pX (·). fully characterizes our random variable X.

. . µN )T = µ T Mean . . . . E [X2] .Partial description Mean vector E [X] = (E [X1] . . . . E [XN ]) = (µ1.

. E [(XN − µN )(X1 − µ1)] 2 σ1 ··· c1N .Covariance Matrix Cov [X] = E (X − µ)(X − µ)T = E [(X1 − µ1)(X1 − µ1)] . E [(XN − µN )(XN − µN )] = =Σ ··· . ··· . .. . .. ··· c1N ··· 2 σN ... E [(X1 − µ1)(XN − µN )] .. .

The covariance has several important properties • If Xi and Xk tend to increase together. then cik = 0 . then cik > 0 • If Xi tends to decrease when Xk increases. then cik < 0 • If Xi and Xk are uncorrelated.Covariance matrix I The covariance matrix C = {cjk } indicates the tendency of each pair of features (dimensions in a random vector) to vary together. to co-vary.

e. then cik>0 can be expressed as ! ik The covariance terms can be expressed as where ρ If xi tends to decrease when xk increases. where "i is the standard deviation of xi cii = "i2 = VAR(xi) is called the correlation coefficient.sjsu. i. 2 c ii $ " i and c ik $ #ik" i" k Xk Xk " ρik where #ik is called = − 1 = −1 ρik the correlationik = 0 ρ coefficient 2 ρik = Xk 1 2 Xk ρik = 1 Xk Xi C ik=-"i"k #ik=-1 Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University Xi C ik=-½"i"k #ik=-½ C ik=0 #ik=0 Xi C ik=+½"i"k #ik=+½ Xi Cik="i"k #ik=+1 Xi *from http://www. (dimensions in a random vector) to vary together.2then cik=0 cii = σi and cik = ρik σiσk |cik|!"i"k.engr. then cik<0 If xi and xk are uncorrelated.• |cik | ≤ σi matrix σi Covarianceσk .htm 17 .. to co-vary* The covariance has several important properties " " " " " If xi and tend to Covariancexkterms increase together.edu/~knapp/HCIRODPR/PR_home. where (1)is the standard deviation of Xi ! ! 2 The cii = σi = Var [Xindicates the tendency of each pair of features • covariance matrix i].

. .  . E [X1XN ] . . . . . = E [XN X1] . . E [XN XN ]   S is called the auto-correlation matrix and contains the same amount of information as the covariance matrix. . ..Covariance matrix II The covariance matrix can be reformulated as Σ = E (X − µ)(X − µ)T = E XXT − µµT = S − µµT with S = E XXT E [X1X1] .

σN • A convenient formulation since Γ contains the scales of the features and R retains the essential information of the relationship between the features. .. . ..  . . ..... • R is the correlation matrix. . .. ... Independence • Two random variables Xi and Xk are uncorrelated if E [XiXk ] = E [Xi] E [Xk ]... .  .  ρ1N σ1 ρ2N   0 .  . . 0 . 0 0 σ2 . . .....  ..The covariance matrix can also be expressed as  σ1 0 Σ = ΓRΓ =  . .... ... . . σN ρ1N ρ12 1 . 0 .. . Correlation Vs. . . . ρ2N .  .  0 0  .  0 1 0   ρ12 .. Uncorrelated variables are also called .  . .  . 1 0 0 σ2 .. .

xk . • Two random variables Xi and Xk are independent if p(xi. xk ) = p(xi) p(xk ) ∀xi.linearly independent.

Covariance intuition σx σx = 0.2249.2588 . σy σy = 0.

. The eigenvalues indicate the magnitude of the spread.Covariance intuition Eigenvectors of Σ are the orthogonal directions where there is the most spread in the underlying distribution.

The Gaussian Distribution .

• Need them to understand the optimal Bayes’ classifier • Need them to understand neural networks. . • Need them to understand mixture models.Some Motivation Where they’re used • Modelling the class conditional pdfs. Why they are so important • They pop up everywhere. a Gaussian distribution is used or is a building block in modelling it. Frequently.

Moore ! ( '+ * "( !) log "( !)#! ( 1.4189 Slide 13 .Unit variance Gaussian The Gaussian distribution with expected value E [X] = 0 and variance Var [X] = 1 has pdf: 1 Unit varianceexp pX (x) = (2π) Gaussian x2 − " ( !) ( 2 & !2 # 1 exp$ ' ! $ 2! 2) " % &[ $ ] ( 0 Var[ $ ] ( 1 + %[ $ ] ( ' Copyright © Andrew W.

2 -=100 Terminology: Write X ∼ N (µ. " % &[ $ ] ( ! Var[ $ ] ( . 2 ! 2) .)2 # ! exp$ ' $ 2.=15 Gaussian & ( ! ' . Copyright © Andrew W. Moore Slide 13 + The Gaussian distribution with expected value E [X] = µ and variance Var [X] = σ 2 has pdf: pX (x) = General exp σ (2π) 1 (x − µ)2 − 2 σ2 1 " ( !) ( .4189 ! ( '+ Copyright © Andrew W. Moore Slide 14 7 .General 1D Gaussian * %[ $ ] ( ' " ( !) log " ( !)#! ( 1. σ 2) to denote: X is distributed as a Gaussian with mean µ and variance σ 2.

Moore * "( !) log "( !)#! ( 1. σ 2) then P (X < x) = erf Copyright © Andrew W.'. 1) then erf is defined as x Assume X ~ N(0. " Unit & 1 1 ( 2+ Gaussian ! % ( '- & %2 # .1) Define ERF(x) = P(X<x) = Cumulative Distribution of X 1 p(z) dz = √ erf(x) = ! 2π z=−∞ $'(variance( % )&% ( !) ( . Moore + %[ $ ] ( ' ! ( '+ Copyright © Andrew W.4189 x−µ σ Slide 16 Slide 13 .The error function The Error Function If X ∼ N (0.exp$ ' 2 !&% $ ! % " %( !2 # " ( !) ( exp$ ' ! $ 2! 2) " % Cumulative z2 exp − 2 z=−∞ x dz Distribution of X &[ $ ] ( 0 Var[ $ ] ( 1 Note if X ∼ N (µ.

.) continuous random variables. N xi i=1 . . x2. . Define 1 z = f (x1. . Var [X]) as N → ∞. .d.The Central Limit Theorem Assume X1. . . X2. XN are identically and independently distributed (i. xN ) = N then p(z) → N (E [X].i. . Frequently used as a justification for assuming Gaussian noise.

4 samples were drawn from the distribution and the n was recorded (for each of the mean of these 4 samples was recorded (for each of the 500 experiments). one sample was drawn from the distribution and e sample size for each mean and not the number of experiments). used to illustrate the imit Theorem s were performed using was drawn from • For N = 4. e of the original distribution is. The its mean was recorded (for each of the 500 histogram of the result shows a uniform density. the sampling distribution a normal distribution • For N = 1. the 500 experiments were performed using a uniform distribution. The histogram starts to show a Gaussian shape. gram shown a uniform density ere drawn from the an of these 4 samples was he 500 experiments) s to show a Gaussian shape .rem states that given a distribution with a mean ! Illustration pling distribution of the mean approaches a a mean (!) and a variance "i2/N as N.

thenot the number of sample size for each mean and sampling distribution normal distribution mit Theorem sampleto illustrate the and not the number of were size for each mean used performed using .was shown a uniform density for N = 7 Similarly am drawn from•the using were performed an was recorded (for each of e drawn from the n of these 4 samples was s drawn from the gram shown a uniform density e was recorded (for each of 500 experiments) ere drawnGaussian shape from the to show a an of these 4 samples was N=10 experiments) am shown he 500 a uniform density for N = 10 • Similarly of to show from the resembles the histograms shape s e drawn a Gaussian ore closely 4 samples was nN=10 of these 21 500 experiments) of the histograms resembles o show a Gaussian shape more closely N=10 21 f the histograms resembles As N grows the histogram increasingly resembles a Gaussian. re closely 21 Limit Theorem as drawn from the sed to illustrate the s were performed using n was recorded (for each of mit Theorem eof the original distribution is.

Bivariate Gaussian
Write random variable X = 1 X . Define X ∼ N (µ, Σ) to mean Y

1 p(x) = − (x − µ)T Σ−1 (x − µ) 1 exp 2 2 π|Σ| 2 where µ= µx µy Σ=
2 σx σxy

σxy 2 σy

and Σ must be symmetric and non-negative definite. It turns out for a Gaussian distribution E [X] = µ and Cov [X] = Σ.

General Gaussian
Have a random variable X = X1, X2, . . . , Xm Then define X ∼ N (µ, Σ) as pX(x) = where µ1  µ2  µ =  . ,  .  µm   
2 σ1 T

.

1 exp − (x − µ)T Σ−1(x − µ) 1 m 2 (2π) 2 |Σ| 2

1

 σ12 Σ= .  . σ1m

σ12 · · · σ1m 2 σ2 · · · σ2m . .  ... . .  2 σ2m · · · σm

and Σ must be symmetric non-negative definite.

General Gaussian

 µ1  µ2  µ =  . ,  .  . µm
15

2 σ1  σ12 Σ= .  . . σ1m

σ12 2 σ2 . . . σ2m

··· ··· ... ···

 σ1m σ2m .  .  . 2 σm

10

5

0

−5

−10

−15

−20

−15

−10

−5

0

5

10

15

20

Axis-aligned Gaussian
  µ1 µ  µ =  .2  ,  .  . µm 
2 σ1

0   . Σ= . .  0 0

0 2 σ2 . . . 0 0

··· ··· ... ··· ···

0 0 . . .
2 σm−1 0

0 0   .  .  .  0  2 σm

15

10

5

0

−5

−10

−15

−25

−20

−15

−10

−5

0

5

10

15

20

25

··· ··· 0 0 . ..  0 σ2  10 5 0 −5 −10 −15 −15 −10 −5 0 5 10 15 . µm 15  σ 0   .  .  0 0  2 0 σ2 .  . 0 0 ··· ··· .  . .  . . Σ= .2  ..Spherical Gaussian  µ1 µ  µ =  . . σ2 0 0 0  . .

 . µm  |Σ| = 0 .  .Degenerate Gaussian  µ1 µ  µ =  .2  .

How can we transform a random variable with a non-diagonal covariance matrix to one with a diagonal covariance matrix? .Recap • Have seen the formulae for Gaussians • You should have an intuition of how they behave • Have some confidence in reading a Gaussian’s covariance matrix.

Definition: v an eigenvector of A if there exists a scalar λ (eigenvalue) such that Av = λv How to compute them: For an eigenvector-eigenvalue pair of A: A v = λv ⇒ A v − λv = 0 ⇒ (A − λI)v = 0 ⇒ v=0 trivial soln (A − λI) is rank deficient .Eigenvectors and eigenvalues Given an m × m matrix A.

For the non-trivial solution when (A − λI) being rank deficient implies m m−1 det(A − λI) = 0 ⇒ λ + a1λ + · · · + am−1λ + am = 0 characteristic equation Solve this characteristic equation to obtain possible values for λ and given those then compute their corresponding eigenvector.  ↑ vm  ↓ ... Some terminology: Matrix formed by the column eigenvectors of A is called the modal matrix M : ↑ M = v 1 ↓  ↑ v2 ↓ ↑ v3 ↓ .

  λm Matrix Λ is the canonical form of A..Let Λ be the diagonal matrix with A’s eigenvalues on the main diagonal:   λ1   λ2     Λ= λ3    . . All eigenvalues are positive.. Properties of A and implications for its eigenvalues: A non-singular A real and symmetric A positive definite ⇒ ⇒ ⇒ All eigenvalues are non-zero. All eigenvalues are real and Eigenvectors associated with distinct eigenvalues are orthogonal.

. for n = 1. . . . . −1 = (M ΛM −1 −1 ) = MΛ −1 M −1 . . Avm] = [λ1v1 λ2v2 . λ2.. .. vm] = [Av1 Av2 . 2. 3. λmvm] = M Λ =⇒ A = M ΛM Consequently: −1 ← the eigen-decomposition of A A = MΛ M A = MΛ M A −1 n n 2 2 −1 .Eigen-decomposition of A If A has non-degenerate eigenvalues λ1. . . λm (no two λi. . λj have the same value) and M is A’s modal matrix then AM = A[v1 v2 . .

and its magnitude will be multiplied by the corresponding eigenvalue ! transformation. and its magnitude will be multiplied by the corresponding eigenvalue λ. an eigenvector represents an View matrix A as a linear transformation: invariant direction in the vector space " When transformed by A. which rotates 3-d on the direction defined by as its only eigenvector and 1 as the corresponding eigenvalue v will remain on that direction. the transformationany point lying vectors about the Z axis has vector [0 0 1] When transformed by A. z * cos ! A $ ( sin ! ( ( 0 ) + sin ! cos ! 0 0 0 1 ' % % % & v $ "0 0 1# T x . y2 x2 v P P P’ y=Ax d x1 v d’=!d y1 " For example. any point lying onan direction defined by v will remain the An eigenvector v represents the invariant direction of on that direction.Interpretation of eigenvectors and eigenvalues (1) ! Interpretation If we view matrix A as a linear transformation.

are uncorrelated regardless of the form of the distribu " 'Back to Covariance * Matricesv% $M M# with M * v % 1 %. Ricardo Gutierrez-Osuna Texas A&M University .x1 x1 .x2 v2 v1 . f X (x) * 1 x2 fY (y y=MTx y2 .y2 Introduction to Pattern principal Their eigenvalues are the variances in these Analysis direction. If the distribution happens to be Gaussian. & .# & Given Σ the covariance of a Gaussian distribution ' 1 $ exp %( (X ( µ)T ) (1(X ( µ)" Eigenvectors of Σ are the principal1/2 & 2 # (2 !)N/2 ) directions of the distribution. then the transform independent 2 -$ '"1 " ! v N " and # * % % " % .

y2 .x1 x1 .y1 y1 . " Linear transformation defined by the eigenvectors of Σof ! leads to vectors th ! The linear transformation defined by the eigenvectors The eigenvectors of ! are the principal directions of the distribution The eigenvectors of ! are the principal directions of the distribution " The eigenvalues are the variances of the corresponding principal directions The eigenvalues are the variances of the corresponding principal directions " If the distribution happens to be Gaussian. then the transformed vectors will be statistically Let M be independent the independent modal matrix of the covariance matrix Σ =⇒ ΣM = M Λ Define the if x ∼ N (µ." %. Σ) then y ∼ N (µ.$ % 1 '"" 1 % -$ % $M * M with 2 transformation:with=M# vT x v 2M * % v1v N "vand! *v%N " and # "* % $M * M# y M * % 1 ! # % " " % .x1 .y1 9 y1 Introduction to Pattern Analysis Introduction Ricardo Gutierrez-Osuna to Pattern Analysis Ricardo Texas A&M University Gutierrez-Osuna Texas A&M University . % .x2 y=M T x =⇒ v2 v1 v2 v1 . Λ) as Σ is a symmetric matrix with real values. . !) ) # i*1 % & # & i '. then the transformed vectors will be statisticall If the distribution happens to be Gaussian.x2 . % # & . ".! Given the covariance matrix ! of a Gaussian distribution Given the covariance matrix ! of a Gaussian distribution ! " " " ! The linear transformation defined by the eigenvectors of ! leads to vectors that are uncorrelated regardless of the form of the distribution areObtain uncorrelated vector regardless form of the distribution uncorrelated regardless of the of the form of the distribution.# "N " & # & & $ " " "N " # i x2 x2 y=MTx y=MTx y2 y2 .-$ '" '. 2 N N This in turn 1 1 means' y11is a Gaussian $random variable with diagonal covariance y )2 $ ' (y i ( µy ) $ T ( 1 * 1 (y) * ' (y i 1 µ exp %( ' T( ( (X ( µ)$ ) (1(X ( µ) " f X (x) * f X (x) 1/2 exp %( 1/2 exp % ) (X ( µ)" (X ( µ) + 2!" " fY (y) * + fY exp %( " N/2 2" i & 2 # (2 !)N/2 ) 2 i*1 2" i i " % " & # 2 ! "i (2matrix.y2 x1 .

Manipulations of Normally Distributed random variables .

AΣAT ) . Then Y ∼ N (Aµ.Linear transforms remain Gaussian Assume X is an m−dimensional Gaussian random variable X ∼ N (µ. Σ) Define Y to be a p−dimensional random variable with Y = AX where A is a p × m matrix.

with U =  . . How do we prove this using the result on the previous slide? . Xm)T be a multivariate Gaussian random variable and define subsets of its variables as follows     Xp+1 X1 U .Gaussian marginals are Gaussian Let X = (X1. Σuu). .  X= V Xm Xp If U V ∼N Σuu Σuv µu . . .  and V =  . µv ΣT Σvv uv Then U is also distributed as a Gaussian with U ∼ N (µu. X2. .

ΣX + ΣY ) If X and Y are not independent but uncorrelated the above result does not hold.Adding 2 independent Gaussians If X ∼ N (µX. ΣX) and Y ∼ N (µY . ΣY ) and X ⊥ Y then ⊥ X + Y ∼ N (µX + µY . .

Conditional of a Gaussian Assume that U V ∼N Σuu Σuv µu . Σu|v where µu|v = µu + ΣT Σ−1(V − µv ) uv vv Σu|v = Σuu − ΣT Σ−1 Σuv uv vv . µv ΣT Σvv uv then U | V ∼ N µu|v .

Consider the conditional covariance: Σu|v = Σuu − ΣT Σ−1 Σuv uv vv Conditional variance is ≤ than the marginal variance.Conditional of a Gaussian Consider the marginal mean: µu|v = µu + ΣT Σ−1(V − µv ) uv vv When V = µv ⇒ the conditional mean of U is µu Marginal mean is a linear function of V. . Conditional varaince is independent of the given value of V.

Σu|v ) then U V with µ= Aµv µv and Σ= AΣvv AT + Σu|v (AΣvv )T AΣvv Σvv ∼ N (µ. Σvv ) . Σ) and V ∼ N (µv .Gaussians and the chain rule Let A be a constant matrix if U | V ∼ N (AV.

Thus assume we have the conditional distribution Score | IQ ∼ N (IQ. on average (over many tests) will be your IQ. IQs are drawn from a Gaussian N (100. Because of noise on any one test the score will often be a few points lower or higher than your true IQ.Are these manipulations useful? Bayesian inference Consider the following problem: In the world as a whole. 152). If you take an IQ test you’ll get a score that. 102) If you take the IQ test and get a score of 130 what is the most likely value of your IQ given this piece of evidence? .

IQ Example Which distribution should we calculuate? How can we get an expression for this distribution from the distributions of Score | IQ and IQ and the manipulations we have described? .

152). 152). Score = 130 Want to find the distribution p(IQ | Score = 130). Score | IQ ∼ N (IQ.Plan This we know: IQ ∼ N (100. Plan: • Use the chain rule to compute the distribution of distributions of IQ and Score | IQ • Swap the order of the random variables to get distribution IQ Score Score IQ from ’s .

• From IQ Score compute the conditional distribution to get the distribution of IQ | (Score = 130) What is the best estimate for test taker’s IQ? .

Today’s assignment .

. • Mail me about any errors you spot in the Exercise notes. • I will notify the class about errors spotted and corrections via the course website and mailing list. • You will be asked to perform some simple Bayesian reasoning.Pen & Paper assignment • Details available on the course website.

only a 1/1000 chance. S. behaves strangely after her death and is considered as a suspect. In fact given the wife-beating evidence alone. had beaten up his wife on at least nine previous occasions. The prosecution offers up this information as evidence in favour of the hypothesis that Mr S is guilty of the murder.. Her husband Mr. is found stabbed in the garden of her family home in the USA.Consider this. Mr S. On investigation the police discover that Mr S. Mrs S. it’s extremely unlikely that he would be the murderer of his wife .” . You should therefore find him innocent.’s lawyer disputes this by saying ”statistically. However. only one in a thousand wife-beaters actually goes on to murder his wife. So the wife beating is not strong evidence at all.

1326 women (28%) were killed by husbands and boyfriends. of those. what is wrong with his argument? . (Let’s assume there are 100 million adult women in the USA). Question: Is the lawyer right to imply that the history of wife-beating does not point to Mr S. In the USA it is estimated that 2 million woman are abused each year by their partners.The prosecution replies with these two following empirical facts. 4739 women were victims of homicide. In 1994.’s being the murderer? Or is the lawyer’s reasoning flawed? If the latter.