You are on page 1of 45

Chapter2: Probability

Notes on MLAPP

Wu Ziqing

School of Computer Science and Engineering


Nanyang Technological University

14/07/2018

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 1 / 45


Outline

1 Basic Concept of Probability Theory


Definition of Probability
Basic Notations
Discrete and Continuous Variable
Mean and Variance
Fundamental Rules
Bayes Rule
Independence

2 Discrete Distributions
Binomial and Bernoulli Distributions
Multinomial and Multinoulli Distributions
Poisson Distribution
Empirical Distribution

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 2 / 45


Outline

3 Continuous Distributions
Normal Distribution
Degenrate pdf
Laplace Distribution
Gamma Distribution
Beta Distribution
Pareto Distribution

4 Joint Probability Distributions


Covariance and Correlation
Multivariate Gaussian
Multivariate Student t Distribution
Dirichlet Distribution

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 3 / 45


Outline

5 Transformations of Random Variables


Linear Transformation
General Transformation
Multivariate Change of Variables
Central Limit Theorem

6 Monte Carlo approximation

7 Information Theory
Entropy
KL Divergence
Mutual Information

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 4 / 45


Basic Concept of Probability Theory
Definition of Probability

There are 2 possible interpretations of a probability:


Frequentist Interpretation: probability represents long run
frequencies of events.
Bayesian Interpretation: probability quantifies our uncertainty of
something happening.
Bayesian view can help us model uncertainty of events that do not
have long term frequencies.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 5 / 45


Basic Concept of Probability Theory
Basic Notations

p(A) : denotes the probability of event A will happen. 0 ≤ p(A) ≤ 1


p(A) : denotes the probability of event not A (Complement).
p(A) = 1 − p(A).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 6 / 45


Basic Concept of Probability Theory
Discrete and Continuous Variable(Discrete)

Discrete Random Variable: variable which can take on any value from a
finite or countable infinite set X . Notation p(X = x) denotes the event
X = x.
p() is called a Probability Mass Function (pmf), 0 ≤ p(x) ≤ 1 and
P
x∈X p(x) = 1.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 7 / 45


Basic Concept of Probability Theory
Discrete and Continuous Variable(Continuous)

If variable is continuous, we define Cumulative Distribution Function


(cdf) as:
F (q) = p(x ≤ q)
Cumulative Distribution is always monotonically increasing.
We define Probability Density Function (pdf) as:
d
f (x) = dx F (x)
Thus the probability of a continuous variable being in a finite interval is:
Rb
P(a < x ≤ b) = a f (x)dx
Probability of a continuous variable taking one value x is:
P(x ≤ X ≤ x + dx) ≈ p(x)dx
Note that here p(x) is allowed to take value > 1, so long as the
density integrates to 1.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 8 / 45
Basic Concept of Probability Theory
Discrete and Continuous Variable(Continuous)(Cont.)

Quantile: Since cdf F is monotonically increasing, F −1 (α) = xα such that


P(x ≤ xα ) = α. α is called the α quantile of F .

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 9 / 45


Basic Concept of Probability Theory
Mean and Variance

Mean/Expected Value: denoted by µ.


For Discrete Variables:
P
E[X ] = x∈X xp(x)
For Continuous Variables:
R
E[X ] = X xp(x)dx
Variance: Variance measures ’spread’ of data, denoted by σ 2 .

Var [X ] = E[(X − µ)2 ] = E[X 2 ] − µ2

Thus, E[X 2 ] = µ2 + σ 2

Standard Deviation: Std deviation adopts the same units as the data.
It is denoted by σ:
p
Std[X ] = Var [X ]
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 10 / 45
Basic Concept of Probability Theory
Fundamental Rules

Joint Probability of Two events:


p(A, B) = p(A ∧ B) = p(A|B)p(B)
Probability of Union of Two Events (Product Rule):

p(A ∨ B) = p(A) + p(B) − p(A ∧ B)


= p(A) + p(B) if A and B are mutually exclusive

Marginal Distribution (Sum Rule):


P P
p(A) = b p(A, B) = b p(A|B = b)p(B = b)
Chain Rule (Product Rule applied several times):
p(X1:D = p(X1 )p(X2 |X1 )p(X3 |X2 , X1 )p(X4 |X3 , X2 , X1 )...p(XD |XD−1 )
Conditional Probability:
p(A,B)
p(A|B) = p(B)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 11 / 45


Basic Concept of Probability Theory
Bayes Rule

Bayes Rule/Bayes Theorem is the combintation of Conditional


Probability, Product Rule and Sum Rule:

p(X = x|Y = y ) = p(X =x,Y =y )


= P p(Y =y |X =x)p(X
0
=x)
0
p(Y =y ) x 0 p(Y =y |X =x )p(X =x )

It is useful to obtain the conditional probability p(A|B) if we already know


p(B|A) and p(A).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 12 / 45


Basic Concept of Probability Theory
Independence

Unconditional Independence/Marginal Independence: denoted by


X ⊥ Y , satisfies:

X ⊥ Y ⇐⇒ p(X , Y ) = p(X )p(Y )

Conditional Independence: In most cases, two variables are


independence only via other variables:

X ⊥ Y |Z ⇐⇒ p(X , Y |Z ) = p(X |Z )p(Y |Z )

Conditional independence has following property:


X ⊥ Y |Z iff there exist functions g () and h() such that:
p(x, y |z) = g (x, z)h(x, z)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 13 / 45


Discrete Distributions
Binomial and Bernoulli Distributions

Binomial Distribution: When we toss a coin n times, with a probability


of head being θ, the number of heads X ∈ {1, 2, ..., n} has a binomial
distribudion:
Bin(k|n, θ) = kn θk (1 − θ)n−k


Binomial Distribution has the following properties:


mean = θ, var = nθ(1 − θ)
Bernoulli Distribution: It is a special case of binomial distribution where
n=1. Thus,
Ber (x|θ) = θII(x=1) (1 − θ)II(x=0)
That is,
(
θ, if x=1
Ber (x|θ) =
1 − θ, if x=0
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 14 / 45
Discrete Distributions
Multinomial and Multinoulli Distributions

Multinomial Distribution: If we toss a K -side die n times instead, with a


probability of landing on each side represented by a vector
θ = (θ1 , θ2 , ..., θK ), let x = (x1 , x2 , ..., xK ), where xj represents the number
of jt h side occurs, then x follows Multinomial Distribution:
n
 QK xj n
 n!
Mu(x|n, θ) = x1 ,...,x K j=1 θ j , where x1 ,...,xK = x1 !x2 !...xK !

Multinoulli Distribution: It is a special case of Multinomial distribution


where n=1. It is also called Categorical Distribution/ Discrete
Distribution.

Cat(x|θ) = Mu(x|1, θ) = K II(xj =1)


Q
j=1 θj , here p(x = j|θ) = θj

Do note that here x becomes a binary vector with all elements 0 or 1


and only one dimension can be 1 (since only one side will occur). It is
also known as dummy encoding or one-hot encoding.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 15 / 45


Discrete Distributions
Poisson Distribution

Poisson Distribution is usually used to calculate the probability of a


number of a certain event occurring in a specified time interval. It has a
parameter λ, which is the average number of event occurring in the
interval.
x
Poi(x|λ) = e −λ λx!

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 16 / 45


Discrete Distributions
Empirical Distribution

Empirical Distribution is obtained from empirical tests. For a dataset


D = {x1 , x2 , ..., xN }:

pemp (A) = N1 N
P
i=1 δxi (A), where δx (A) only = 1 if x ∈ A

The probability for each result to occur is associated with a weight wi :

p(x) = N
P PN
i=1 wi δxi (x), where 0 ≤ wi ≤ 1 and i=1 wi = 1.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 17 / 45


Continuous Distributions
Normal Distribution

Normal Distribution/Gaussian Distribution has the pdf of:


1
− (x−µ)2
N (x|µ, σ 2 ) = √ 1 e 2σ2
2πσ 2

Its cdf is:


Rx
Φ(x; µ, σ 2 ) = −∞ N (z|µ, σ
2 )dz, where z = (x − µ)/σ

which is usually implemented as:


√ Rx 2
Φ(x; µ, σ 2 ) = 12 [1 + erf (z/ 2)], where erf (x) = √2
π 0 e −t dt

Standard Normal Distribution: the Normal distribution of N ∼ (0, 1).


Precision: precision λ = 1/σ 2 . The higher the precision, the smaller the
variance, the narrower the distribution.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 18 / 45


Continuous Distributions
Normal Distribution(Cont.)

Advantage of Normal Distribution:


1 It contains 2 parameter which captures the most basic property of a
distribution.
2 The sum of independent random variables have an approximately
Normal Distribution (Central Limit Theorem).
3 It makes the least amount of assumptions.
4 It has simple mathematical form, which is easy to implement.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 19 / 45


Continuous Distributions
Degenerate pdf (Dirac Delta Function)

Dirac Delta Function: When variance of a Normal Distribution


approaches 0, its pdf becomes infinitely thin and tall:

lim N (x|µ, σ 2 ) = δ(x − µ)


σ 2 →0

Here δ(x − µ) is called Dirac Delta Function. It is defined as:


(
∞, if x=0 R∞
δ(x) = , where −∞ δ(x)dx = 1
0, if x 6= 0

Dirac Delta Function has sifting property: it will select out a single term
from a sum of integral:
R∞
−∞ f (x)δ(x − µ)dx = f (u)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 20 / 45


Continuous Distributions
Degenerate pdf (Student t Distribution)

Student t Distribution: Compared to Normal Distribution, Student t


Distribution is less affected by outliers. It has the following pdf:
v +1
T (x|µ, σ 2 , v ) ∝ [1 + v1 ( x−µ 2 −(
σ ) ]
2
)

v > 0 is called degree of freedom.


Student t Distribution has the following properties:
mode = µ
mean = µ only if v > 1
v σ2
variance = (v −2) only if v > 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 21 / 45


Continuous Distributions
Laplace Distribution

Laplace Distribution/Double Sided Exponential Distribution is also a


distribution more inert to outliers compared to Normal Distribution. It has
the following pdf:
1 |x−µ|
Lap(x|µ, b) = 2b exp(− b )

If has the following properties:


mode = µ
mean = µ
variance = 2b 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 22 / 45


Continuous Distributions
Gamma Distribution

Gamma Distribution is a flexible distribution for positive real value


random variables. It is defined as following:
a
b
Γ(T |shape = a, rate = b) = Γ(a) T a−1 e −Tb ,
R∞
where Γ(x) = 0 u x−1 e −u du

Gamma Distribution has following properties:


a−1
mode = b
a
mean = b
variance = ba2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 23 / 45


Continuous Distributions
Gamma Distribution(Cont.)

Special cases for Gamma Distribution:


Exponential Distribution: is defined as Expon(x|λ) = Γ(x|1, λ),
where λ is the parameter in Poisson Distribution. It describes the
time between two consecutive events in a Poisson process.
Erlang Distribution: is defined as Erlang (x|λ) = Γ(x|2, λ).
Chi-squared Distribution: is defined by χ2 (x|v ) = Γ(x| v2 , 12 ).It is
the distribution of sum P
of squared Gaussian variables. i.e., if
Zi ∼ N (0, 1), and S = vi=1 Zi 2 , thenS ∼ χv 2 .

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 24 / 45


Continuous Distributions
Gamma Distribution(Cont.)

1
Inverse Gamma Distribution: if X ∼ Ga(a, b), then X ∼ IG (a, b),
which is defined by:
b a −(a+1) −b/x
IG (x|a, b) = Γ(a) x e

IG has following properties:


b
mode = a+1
b
mean = a−1 , only if a > 1
b2
variance = (a−1)2 (a−2)
, only if a > 2

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 25 / 45


Continuous Distributions
Beta Distribution

Beta Distribution has interval over [0,1] and is defined as:


1 a−1 (1 Γ(a)Γ(b)
Beta(x|a, b) = B(a,b) x − x)b−1 , where B(a, b) = Γ(a+b)

Beta Distribution only exists when a, b > 0.


If a = b = 1, the distribution turns into a Uniform Distribution.
If a, b < 1, the distribution turns into a Bimodal Distribution, which
spikes at 0 and 1.
If a, b > 1, the distribution turns into a Unimodal Distribution, which
has a heap shape.
Beta Distribution has following properties:
a−1
mode = a+b−2
a
mean = a+b
ab
variance = (a+b)2 (a+b+1)
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 26 / 45
Continuous Distributions
Pareto Distribution

Pareto Distribution is used to model the quantities that exhibits long


tails. It is defined as:

Pareto(x|k, m) = kmk x −(k+1) II(x ≥ m)

Pareto Distribution has following properties:


mode = m
km
mean = k−1 , only if k > 1
m2 k
variance = (k−1)2 (k−2)
, only if k > 2.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 27 / 45


Joint Probability Distribution

A Joint Probability Distribution has multiple variables and has the form
of p(x1 , x2 , ..., xD ) for a set of variables x.
If all variables are discrete, we can represent Joint Probability in a
multi-dimensional array, with one variable in each dimension.
The size of high dimensional array can be reduced by making Conditional
Independence assumptions, or restrict the pdf into a certain
functional forms (for continuous distribution).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 28 / 45


Joint Probability Distribution
Covariance

Covariance describes the degree which two variables are linearly related:

cov [X , Y ] = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y )

If x is a d-dimensional vector containing d variables, its Covariance


Matrix is defined as:

cov [x] = E[(x − E[x])(x − E[x])T ]


 
var [X1 ] cov [X1 , X2 ] ... cov [X1 , Xd ]
 cov [X1 , X2 ] var [X2 ] ... cov [X2 , Xd ]
=  
 ... ... ... ... 
cov [Xd , X1 ] cov [Xd , X2 ] ... var [Xd ]

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 29 / 45


Joint Probability Distribution
Correlation

Correlation Coefficient normalises Covariance and gives it a finite upper


bound:
corr [X , Y ] = √ cov [X ,Y ]
var [X ]var [Y ]

A Correlation Matrix R is thus:


 
corr [X1 , X1 ] corr [X1 , X2 ] ... corr [X1 , Xd ]
 corr [X1 , X2 ] corr [X2 , X2 ] ... corr [X2 , Xd ] 
R= 
 ... ... ... ... 
corr [Xd , X1 ] corr [Xd , X2 ] ... corr [Xd , Xd ]
Correlation Coefficient measures the degree of linearity:
corr [X , Y ] = 1 if X and Y have a linear relationship.
corr [X , Y ] = 0 means X and Y are uncorrelated.
If X,Y are independent (i.e., p(X , Y ) = p(X )p(Y )), corr [X , Y ] = 0,
but not visa versa.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 30 / 45
Joint Probability Distribution
Multivariate Gaussian

Multivariate Gaussian is defined as:

N (x|µ, Σ) = 1
exp[− 12 (x − µ)T Σ−1 (x − µ)],
(2π)D/2 |Σ|1/2

where µ is the mean vector of x, and Σ is the D × D Covariance Matrix.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 31 / 45


Joint Probability Distribution
Multivariate Student t Distribution

Multivariate Student t Distribution is defined as:

Γ(v /2 + D/2) |Σ|−1/2 1 v +D


T (x|µ, Σ, v ) = D/2 D/2
× [1 + (x − µ)T Σ−1 (x − µ)]− 2
Γ(v /2) v π v
Γ(v /2 + D/2) v +D
= |piV |−1/2 × [1 + (x − µ)T V −1 (x − µ)]− 2
Γ(v /2)
where V = v Σ

Multivariate Student t Distribution has following properties:


mode = µ
mean = µ
v
variance = v −2 Σ

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 32 / 45


Joint Probability Distribution
Multivariate Student t Distribution

Dirichlet DistributionP has the support


SK = {x : 0 ≤ xk ≤ 1, K k=1 xk = 1}. It is defined by:

1 QK αk −1 II(x ∈ S ),
Dir (x|α) = B(α) k=1 xk K
QK
Γ(αk ) PK
where B(α) = k=1
Γ(α0 ) and α0 = k=1 αk

Dirichlet Distribution has following properties:


αk −1
mode[xk ] = α0 −K
mean[xk ] = ααk0
0 −αk )
variance[xk ] = ααk2(α
0 (α0 +1)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 33 / 45


Transformations of Random Variables
Linear Transformation

If y = f (x) = Ax + b, then:
E[y ] = E[Ax + b] = Aµ + b
cov [y ] = cov [Ax + b] = AΣAT

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 34 / 45


Transformations of Random Variables
General Transformation

General Transformation: If we change x into y = f (x):


if x is discrete,
P
py (y ) = x:f (x)=y px (x)
if x in continuous, we work on cdf first:
Py (y ) = P(Y ≤ y ) = P(f (x) ≤ y ) = P(x ∈ x|f (x) ≤ y )
Since cdf is monotonically increasing, it can be inverted:
Py (y ) = P(f (x) ≤ y ) = P(X ≤ f −1 (y )) = Px (f −1 (y ))
To obtain the pdf, we can differentiate cmf:
py (y ) = d d −1 (y )) dx d dx
dy Py (y ) = dy Py (f = dy dx Px (x) = dy px (x)
Since the sign is insignificant, we get:
dx
py (y ) = | dy |px (x) (Change of Variables Formula)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 35 / 45


Transformations of Random Variables
Multivariate Change of Variables

If f () is a function that maps x to y , which both are vectors of


n-dimensional vectors, then dy dx is given by |detJx→y |, where J is its
Jacobian Matrix:
 δy δy δy1

1 1
δx1 δx2 ... δxn
 δy2 δy2 δy 
 δx1 δx2 ... δxn2 
Jx→y = δ(y 1 ,y2 ,...,yn )
δ(x1 ,x2 ,...,xn ) = 
 ... ... ... ... 

δyn δyn δyn
δx1 δx2 ... δxn

If f () is a invertible mapping, then according to Change of Variables


Formula,

py (y ) = px (x)|detJy →x |

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 36 / 45


Transformations of Random Variables
Central Limit Theorem

For N random variables with pdf, each with the same µ and σ 2 , i.e., the
variables are
PNindependent and identically distributed (idd).
Let SN = i=1 Xi , i.e., the sum of all variables, as N increases,
2
p(SN = s) = √ 1
2πNσ
exp(− (s−Nµ)
2Nσ 2
)

The distribution converges to a standard normal.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 37 / 45


Monte Carlo approximation

Monte Carlo approximation computes the distribution of a function by:


1 generate S samples from the distribution, x1 , x2 , ..., xS
2 approximate the distribution using the empirical distribution of
{f (xs )}Ss=1 , by calculate the arithmetic mean of the function applied
to the sample:
E[f (x)] = f (x)p(x)dx ≈ S1 Ss=1 f (xs )
R P

From the samples drawn, we can compute the following quantities:


E(X ) = x̄ = S1 Ss=1 xs
P

var [X ] = barx = S1 Ss=1 (x − x̄)2


P

median(X) = medianx1 , x2 , ..., xS


(P(X ≤ c) = S1 No.{xs ≤ c}

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 38 / 45


Monte Carlo approximation (Cont.)

Accuracy of Monte Carlo approximation depends on the number of


samples drawn. By Central Limit Theorem, the error of MC, i.e., the
difference between actual mean and the sample mean, is:
2
µ̂ − µ → N (0, σS ),

where σ 2 , although unknown, can be estimated by:

σ̂ 2 = S1 Ss=1 (f (x) − µ̂)2


P

Since in Normal Distribution,

P{µ − 1.96 √σ̂S ≤ µ̂ ≤ µ + 1.96 √σ̂S } ≈ 0.95,


where √σ̂ is called the Standard Error.
S

We can obtain a estimation accurate to with in ± with probability at


2
least 95% by having sample size S ≥ 4σ̂
2
.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 39 / 45


Information Theory

Definition: Information Theory is concerned with:


Data Compression/Source Coding: represent data in a compressed
fashion.
Error Correction/Channel Coding: transmit and store data in a
way that is robust to errors.

Relation to Probability Model:


Compressing data need to represent message with high frequency with
short code words, and reserve long words for rarely used messages.
A good probability model is required for decoding messages sent over
noisy channels.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 40 / 45


Information Theory
Entropy

Entropy of a random variable X with distribution p, is a measure of its


uncertainty. It is denoted by H(X ):

H(X ) = − K
P
k=1 p(X = k)log2 p(X = k)

Entropy has following properties:


The unit when using log2 is called bits (binary digits). The unit
when using ln is called nats (natural digits).
Uniform distribution has the maximum Entropy, H(X ) = log2 K for a
K -ray random variable.
Deterministic Distribution (all mass is on one state) has the minimum
Entropy, H(X ) = 0.
If X is a binary variable, and we denote p(X = 1) = θ, we have
Binary Entropy Function:
H(X ) = −[θlog2 θ + (1 − θ)log2 (1 − θ)]
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 41 / 45
Information Theory
KL Divergence

Kullback-Leibler Divergence (KL Divergence) is a measure of


dissimilarity between two probabilities p and q:
K
X pk
KL(p||q) = pk log
qk
k=1
K
X K
X
= pk logpk − pk logqk
k=1 k=1
= −H(p) + H(p, q)
H(p, q) is called cross entropy. It represents the average number of bits
needed to encode source with distribution p to model q. Thus, KL
Divergence is the average number extra bits to encode data.
Information Inequality Theorem states that:
KL(p||q) ≥ 0 and only = 0 if p = q.
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 42 / 45
Information Theory
Mutual Information

Mutual Information (MI) show how much knowing one variable x can
tell us another variable y. It is defined by the KL Divergence between the
Joint Probability p(X , Y ) and the factored probability p(X )p(Y ):

II(X ; Y ) = KL(p(X , Y )||p(X )p(Y ))


XX p(x, y )
= p(x, y ) log
x y
p(x)p(y )
= H(X ) − H(X |Y )
= H(Y ) − H(Y |X )
P
H(Y |X ) is called Conditional Entropy, which = x p(x)H(Y |X = x).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 43 / 45


Information Theory
Mutual Information (Cont.)

From the previous equation, we can know:


II(X , Y ) = 0 only if p(X , Y ) = p(X )p(Y ), which means X and Y are
independent.
According to the last two lines of the previous equation, we can
interpret MI as the reduction in uncertainty about X after observing
Y, or vice versa.

Pointwise Mutual Information measures the discrepancy between two


events occurring together as compared to what would be observed by
chance. It is defined as:
p(x,y ) p(x|y ) p(y |x)
PMI (x, y ) = log p(x)p(y ) = log p(x) = log p(y )

From the equation, we can know that MI is the expected value of PMI.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 44 / 45


Information Theory
Mutual Information (Cont.)

MI for continuous variables: We need to first discretise the variable by


separating the variables into different bins.
The size and boundary of the bins can be selected by trying many
combinations and calculate the largest among them. This normalised
statistic is called Mutual Information Coefficient:
max G ∈G(x,y ) II(X (G );Y (G )
MIC = max [ log min(x,y ) ]
x,y :xy <B

G(x, y ) is the set of 2d grids of x × y ; X (G ), Y (G ) are the


discretisation of the variables on the grid; B is a sample-size
dependent bound usually set to N 0.6 .
MIC ∈ [0, 1], where 0 represents no relationship and 1 represents noise-free
relationship (not only limited to linear relationship).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 45 / 45