41 views

Original Title: Chapter2 Probability

Uploaded by wuziqi

- Stat
- Medical Stats Unit 4 for Pdfs
- Probability
- Cryptanalysis of Symmetric-Key Primitives
- Rr210403 Probability Theory and Stochastic Process
- obp
- MIT Lecture Notes on Probability
- Class Naotes
- Dessein_RationalInattentionandOrganizationalFocus
- Introduction to Queueing Theory and Stochastic Teletraffic Models.pdf
- KSS03 Chapter 04
- 1307.2968
- UT Dallas Syllabus for ee6341.001.08f taught by Aria Nosratinia (aria)
- Lecture 9
- A Probability Primer
- therml
- SC0x M2Unit3 W7L2 ContinuousDist CLEAN
- Data Structures and Algorithms in Java (6e) - Useful Mathematical Facts.pdf
- Grade 8 Data Handling Probability Statistics In
- 2

You are on page 1of 45

Notes on MLAPP

Wu Ziqing

Nanyang Technological University

14/07/2018

Outline

Definition of Probability

Basic Notations

Discrete and Continuous Variable

Mean and Variance

Fundamental Rules

Bayes Rule

Independence

2 Discrete Distributions

Binomial and Bernoulli Distributions

Multinomial and Multinoulli Distributions

Poisson Distribution

Empirical Distribution

Outline

3 Continuous Distributions

Normal Distribution

Degenrate pdf

Laplace Distribution

Gamma Distribution

Beta Distribution

Pareto Distribution

Covariance and Correlation

Multivariate Gaussian

Multivariate Student t Distribution

Dirichlet Distribution

Outline

Linear Transformation

General Transformation

Multivariate Change of Variables

Central Limit Theorem

7 Information Theory

Entropy

KL Divergence

Mutual Information

Basic Concept of Probability Theory

Definition of Probability

Frequentist Interpretation: probability represents long run

frequencies of events.

Bayesian Interpretation: probability quantifies our uncertainty of

something happening.

Bayesian view can help us model uncertainty of events that do not

have long term frequencies.

Basic Concept of Probability Theory

Basic Notations

p(A) : denotes the probability of event not A (Complement).

p(A) = 1 − p(A).

Basic Concept of Probability Theory

Discrete and Continuous Variable(Discrete)

Discrete Random Variable: variable which can take on any value from a

finite or countable infinite set X . Notation p(X = x) denotes the event

X = x.

p() is called a Probability Mass Function (pmf), 0 ≤ p(x) ≤ 1 and

P

x∈X p(x) = 1.

Basic Concept of Probability Theory

Discrete and Continuous Variable(Continuous)

(cdf) as:

F (q) = p(x ≤ q)

Cumulative Distribution is always monotonically increasing.

We define Probability Density Function (pdf) as:

d

f (x) = dx F (x)

Thus the probability of a continuous variable being in a finite interval is:

Rb

P(a < x ≤ b) = a f (x)dx

Probability of a continuous variable taking one value x is:

P(x ≤ X ≤ x + dx) ≈ p(x)dx

Note that here p(x) is allowed to take value > 1, so long as the

density integrates to 1.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 8 / 45

Basic Concept of Probability Theory

Discrete and Continuous Variable(Continuous)(Cont.)

P(x ≤ xα ) = α. α is called the α quantile of F .

Basic Concept of Probability Theory

Mean and Variance

For Discrete Variables:

P

E[X ] = x∈X xp(x)

For Continuous Variables:

R

E[X ] = X xp(x)dx

Variance: Variance measures ’spread’ of data, denoted by σ 2 .

Thus, E[X 2 ] = µ2 + σ 2

Standard Deviation: Std deviation adopts the same units as the data.

It is denoted by σ:

p

Std[X ] = Var [X ]

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 10 / 45

Basic Concept of Probability Theory

Fundamental Rules

p(A, B) = p(A ∧ B) = p(A|B)p(B)

Probability of Union of Two Events (Product Rule):

= p(A) + p(B) if A and B are mutually exclusive

P P

p(A) = b p(A, B) = b p(A|B = b)p(B = b)

Chain Rule (Product Rule applied several times):

p(X1:D = p(X1 )p(X2 |X1 )p(X3 |X2 , X1 )p(X4 |X3 , X2 , X1 )...p(XD |XD−1 )

Conditional Probability:

p(A,B)

p(A|B) = p(B)

Basic Concept of Probability Theory

Bayes Rule

Probability, Product Rule and Sum Rule:

= P p(Y =y |X =x)p(X

0

=x)

0

p(Y =y ) x 0 p(Y =y |X =x )p(X =x )

p(B|A) and p(A).

Basic Concept of Probability Theory

Independence

X ⊥ Y , satisfies:

independence only via other variables:

X ⊥ Y |Z iff there exist functions g () and h() such that:

p(x, y |z) = g (x, z)h(x, z)

Discrete Distributions

Binomial and Bernoulli Distributions

of head being θ, the number of heads X ∈ {1, 2, ..., n} has a binomial

distribudion:

Bin(k|n, θ) = kn θk (1 − θ)n−k

mean = θ, var = nθ(1 − θ)

Bernoulli Distribution: It is a special case of binomial distribution where

n=1. Thus,

Ber (x|θ) = θII(x=1) (1 − θ)II(x=0)

That is,

(

θ, if x=1

Ber (x|θ) =

1 − θ, if x=0

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 14 / 45

Discrete Distributions

Multinomial and Multinoulli Distributions

probability of landing on each side represented by a vector

θ = (θ1 , θ2 , ..., θK ), let x = (x1 , x2 , ..., xK ), where xj represents the number

of jt h side occurs, then x follows Multinomial Distribution:

n

QK xj n

n!

Mu(x|n, θ) = x1 ,...,x K j=1 θ j , where x1 ,...,xK = x1 !x2 !...xK !

where n=1. It is also called Categorical Distribution/ Discrete

Distribution.

Q

j=1 θj , here p(x = j|θ) = θj

and only one dimension can be 1 (since only one side will occur). It is

also known as dummy encoding or one-hot encoding.

Discrete Distributions

Poisson Distribution

number of a certain event occurring in a specified time interval. It has a

parameter λ, which is the average number of event occurring in the

interval.

x

Poi(x|λ) = e −λ λx!

Discrete Distributions

Empirical Distribution

D = {x1 , x2 , ..., xN }:

pemp (A) = N1 N

P

i=1 δxi (A), where δx (A) only = 1 if x ∈ A

p(x) = N

P PN

i=1 wi δxi (x), where 0 ≤ wi ≤ 1 and i=1 wi = 1.

Continuous Distributions

Normal Distribution

1

− (x−µ)2

N (x|µ, σ 2 ) = √ 1 e 2σ2

2πσ 2

Rx

Φ(x; µ, σ 2 ) = −∞ N (z|µ, σ

2 )dz, where z = (x − µ)/σ

√ Rx 2

Φ(x; µ, σ 2 ) = 12 [1 + erf (z/ 2)], where erf (x) = √2

π 0 e −t dt

Precision: precision λ = 1/σ 2 . The higher the precision, the smaller the

variance, the narrower the distribution.

Continuous Distributions

Normal Distribution(Cont.)

1 It contains 2 parameter which captures the most basic property of a

distribution.

2 The sum of independent random variables have an approximately

Normal Distribution (Central Limit Theorem).

3 It makes the least amount of assumptions.

4 It has simple mathematical form, which is easy to implement.

Continuous Distributions

Degenerate pdf (Dirac Delta Function)

approaches 0, its pdf becomes infinitely thin and tall:

σ 2 →0

(

∞, if x=0 R∞

δ(x) = , where −∞ δ(x)dx = 1

0, if x 6= 0

Dirac Delta Function has sifting property: it will select out a single term

from a sum of integral:

R∞

−∞ f (x)δ(x − µ)dx = f (u)

Continuous Distributions

Degenerate pdf (Student t Distribution)

Distribution is less affected by outliers. It has the following pdf:

v +1

T (x|µ, σ 2 , v ) ∝ [1 + v1 ( x−µ 2 −(

σ ) ]

2

)

Student t Distribution has the following properties:

mode = µ

mean = µ only if v > 1

v σ2

variance = (v −2) only if v > 2

Continuous Distributions

Laplace Distribution

distribution more inert to outliers compared to Normal Distribution. It has

the following pdf:

1 |x−µ|

Lap(x|µ, b) = 2b exp(− b )

mode = µ

mean = µ

variance = 2b 2

Continuous Distributions

Gamma Distribution

random variables. It is defined as following:

a

b

Γ(T |shape = a, rate = b) = Γ(a) T a−1 e −Tb ,

R∞

where Γ(x) = 0 u x−1 e −u du

a−1

mode = b

a

mean = b

variance = ba2

Continuous Distributions

Gamma Distribution(Cont.)

Exponential Distribution: is defined as Expon(x|λ) = Γ(x|1, λ),

where λ is the parameter in Poisson Distribution. It describes the

time between two consecutive events in a Poisson process.

Erlang Distribution: is defined as Erlang (x|λ) = Γ(x|2, λ).

Chi-squared Distribution: is defined by χ2 (x|v ) = Γ(x| v2 , 12 ).It is

the distribution of sum P

of squared Gaussian variables. i.e., if

Zi ∼ N (0, 1), and S = vi=1 Zi 2 , thenS ∼ χv 2 .

Continuous Distributions

Gamma Distribution(Cont.)

1

Inverse Gamma Distribution: if X ∼ Ga(a, b), then X ∼ IG (a, b),

which is defined by:

b a −(a+1) −b/x

IG (x|a, b) = Γ(a) x e

b

mode = a+1

b

mean = a−1 , only if a > 1

b2

variance = (a−1)2 (a−2)

, only if a > 2

Continuous Distributions

Beta Distribution

1 a−1 (1 Γ(a)Γ(b)

Beta(x|a, b) = B(a,b) x − x)b−1 , where B(a, b) = Γ(a+b)

If a = b = 1, the distribution turns into a Uniform Distribution.

If a, b < 1, the distribution turns into a Bimodal Distribution, which

spikes at 0 and 1.

If a, b > 1, the distribution turns into a Unimodal Distribution, which

has a heap shape.

Beta Distribution has following properties:

a−1

mode = a+b−2

a

mean = a+b

ab

variance = (a+b)2 (a+b+1)

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 26 / 45

Continuous Distributions

Pareto Distribution

tails. It is defined as:

mode = m

km

mean = k−1 , only if k > 1

m2 k

variance = (k−1)2 (k−2)

, only if k > 2.

Joint Probability Distribution

A Joint Probability Distribution has multiple variables and has the form

of p(x1 , x2 , ..., xD ) for a set of variables x.

If all variables are discrete, we can represent Joint Probability in a

multi-dimensional array, with one variable in each dimension.

The size of high dimensional array can be reduced by making Conditional

Independence assumptions, or restrict the pdf into a certain

functional forms (for continuous distribution).

Joint Probability Distribution

Covariance

Covariance describes the degree which two variables are linearly related:

Matrix is defined as:

var [X1 ] cov [X1 , X2 ] ... cov [X1 , Xd ]

cov [X1 , X2 ] var [X2 ] ... cov [X2 , Xd ]

=

... ... ... ...

cov [Xd , X1 ] cov [Xd , X2 ] ... var [Xd ]

Joint Probability Distribution

Correlation

bound:

corr [X , Y ] = √ cov [X ,Y ]

var [X ]var [Y ]

corr [X1 , X1 ] corr [X1 , X2 ] ... corr [X1 , Xd ]

corr [X1 , X2 ] corr [X2 , X2 ] ... corr [X2 , Xd ]

R=

... ... ... ...

corr [Xd , X1 ] corr [Xd , X2 ] ... corr [Xd , Xd ]

Correlation Coefficient measures the degree of linearity:

corr [X , Y ] = 1 if X and Y have a linear relationship.

corr [X , Y ] = 0 means X and Y are uncorrelated.

If X,Y are independent (i.e., p(X , Y ) = p(X )p(Y )), corr [X , Y ] = 0,

but not visa versa.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 30 / 45

Joint Probability Distribution

Multivariate Gaussian

N (x|µ, Σ) = 1

exp[− 12 (x − µ)T Σ−1 (x − µ)],

(2π)D/2 |Σ|1/2

Joint Probability Distribution

Multivariate Student t Distribution

T (x|µ, Σ, v ) = D/2 D/2

× [1 + (x − µ)T Σ−1 (x − µ)]− 2

Γ(v /2) v π v

Γ(v /2 + D/2) v +D

= |piV |−1/2 × [1 + (x − µ)T V −1 (x − µ)]− 2

Γ(v /2)

where V = v Σ

mode = µ

mean = µ

v

variance = v −2 Σ

Joint Probability Distribution

Multivariate Student t Distribution

SK = {x : 0 ≤ xk ≤ 1, K k=1 xk = 1}. It is defined by:

1 QK αk −1 II(x ∈ S ),

Dir (x|α) = B(α) k=1 xk K

QK

Γ(αk ) PK

where B(α) = k=1

Γ(α0 ) and α0 = k=1 αk

αk −1

mode[xk ] = α0 −K

mean[xk ] = ααk0

0 −αk )

variance[xk ] = ααk2(α

0 (α0 +1)

Transformations of Random Variables

Linear Transformation

If y = f (x) = Ax + b, then:

E[y ] = E[Ax + b] = Aµ + b

cov [y ] = cov [Ax + b] = AΣAT

Transformations of Random Variables

General Transformation

if x is discrete,

P

py (y ) = x:f (x)=y px (x)

if x in continuous, we work on cdf first:

Py (y ) = P(Y ≤ y ) = P(f (x) ≤ y ) = P(x ∈ x|f (x) ≤ y )

Since cdf is monotonically increasing, it can be inverted:

Py (y ) = P(f (x) ≤ y ) = P(X ≤ f −1 (y )) = Px (f −1 (y ))

To obtain the pdf, we can differentiate cmf:

py (y ) = d d −1 (y )) dx d dx

dy Py (y ) = dy Py (f = dy dx Px (x) = dy px (x)

Since the sign is insignificant, we get:

dx

py (y ) = | dy |px (x) (Change of Variables Formula)

Transformations of Random Variables

Multivariate Change of Variables

n-dimensional vectors, then dy dx is given by |detJx→y |, where J is its

Jacobian Matrix:

δy δy δy1

1 1

δx1 δx2 ... δxn

δy2 δy2 δy

δx1 δx2 ... δxn2

Jx→y = δ(y 1 ,y2 ,...,yn )

δ(x1 ,x2 ,...,xn ) =

... ... ... ...

δyn δyn δyn

δx1 δx2 ... δxn

Formula,

py (y ) = px (x)|detJy →x |

Transformations of Random Variables

Central Limit Theorem

For N random variables with pdf, each with the same µ and σ 2 , i.e., the

variables are

PNindependent and identically distributed (idd).

Let SN = i=1 Xi , i.e., the sum of all variables, as N increases,

2

p(SN = s) = √ 1

2πNσ

exp(− (s−Nµ)

2Nσ 2

)

Monte Carlo approximation

1 generate S samples from the distribution, x1 , x2 , ..., xS

2 approximate the distribution using the empirical distribution of

{f (xs )}Ss=1 , by calculate the arithmetic mean of the function applied

to the sample:

E[f (x)] = f (x)p(x)dx ≈ S1 Ss=1 f (xs )

R P

E(X ) = x̄ = S1 Ss=1 xs

P

P

(P(X ≤ c) = S1 No.{xs ≤ c}

Monte Carlo approximation (Cont.)

samples drawn. By Central Limit Theorem, the error of MC, i.e., the

difference between actual mean and the sample mean, is:

2

µ̂ − µ → N (0, σS ),

P

where √σ̂ is called the Standard Error.

S

2

least 95% by having sample size S ≥ 4σ̂

2

.

Information Theory

Data Compression/Source Coding: represent data in a compressed

fashion.

Error Correction/Channel Coding: transmit and store data in a

way that is robust to errors.

Compressing data need to represent message with high frequency with

short code words, and reserve long words for rarely used messages.

A good probability model is required for decoding messages sent over

noisy channels.

Information Theory

Entropy

uncertainty. It is denoted by H(X ):

H(X ) = − K

P

k=1 p(X = k)log2 p(X = k)

The unit when using log2 is called bits (binary digits). The unit

when using ln is called nats (natural digits).

Uniform distribution has the maximum Entropy, H(X ) = log2 K for a

K -ray random variable.

Deterministic Distribution (all mass is on one state) has the minimum

Entropy, H(X ) = 0.

If X is a binary variable, and we denote p(X = 1) = θ, we have

Binary Entropy Function:

H(X ) = −[θlog2 θ + (1 − θ)log2 (1 − θ)]

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 41 / 45

Information Theory

KL Divergence

dissimilarity between two probabilities p and q:

K

X pk

KL(p||q) = pk log

qk

k=1

K

X K

X

= pk logpk − pk logqk

k=1 k=1

= −H(p) + H(p, q)

H(p, q) is called cross entropy. It represents the average number of bits

needed to encode source with distribution p to model q. Thus, KL

Divergence is the average number extra bits to encode data.

Information Inequality Theorem states that:

KL(p||q) ≥ 0 and only = 0 if p = q.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 42 / 45

Information Theory

Mutual Information

Mutual Information (MI) show how much knowing one variable x can

tell us another variable y. It is defined by the KL Divergence between the

Joint Probability p(X , Y ) and the factored probability p(X )p(Y ):

XX p(x, y )

= p(x, y ) log

x y

p(x)p(y )

= H(X ) − H(X |Y )

= H(Y ) − H(Y |X )

P

H(Y |X ) is called Conditional Entropy, which = x p(x)H(Y |X = x).

Information Theory

Mutual Information (Cont.)

II(X , Y ) = 0 only if p(X , Y ) = p(X )p(Y ), which means X and Y are

independent.

According to the last two lines of the previous equation, we can

interpret MI as the reduction in uncertainty about X after observing

Y, or vice versa.

events occurring together as compared to what would be observed by

chance. It is defined as:

p(x,y ) p(x|y ) p(y |x)

PMI (x, y ) = log p(x)p(y ) = log p(x) = log p(y )

From the equation, we can know that MI is the expected value of PMI.

Information Theory

Mutual Information (Cont.)

separating the variables into different bins.

The size and boundary of the bins can be selected by trying many

combinations and calculate the largest among them. This normalised

statistic is called Mutual Information Coefficient:

max G ∈G(x,y ) II(X (G );Y (G )

MIC = max [ log min(x,y ) ]

x,y :xy <B

discretisation of the variables on the grid; B is a sample-size

dependent bound usually set to N 0.6 .

MIC ∈ [0, 1], where 0 represents no relationship and 1 represents noise-free

relationship (not only limited to linear relationship).

- StatUploaded byRani Gil
- Medical Stats Unit 4 for PdfsUploaded byjadetine
- ProbabilityUploaded byRussell Mindanao
- Cryptanalysis of Symmetric-Key PrimitivesUploaded byttnleeobh7382
- Rr210403 Probability Theory and Stochastic ProcessUploaded bySRINIVASA RAO GANTA
- obpUploaded bysharma05031989
- MIT Lecture Notes on ProbabilityUploaded byOmegaUser
- Class NaotesUploaded byloliama
- Dessein_RationalInattentionandOrganizationalFocusUploaded byBhuwan
- Introduction to Queueing Theory and Stochastic Teletraffic Models.pdfUploaded byDominic Aloc
- KSS03 Chapter 04Uploaded bypooppa
- 1307.2968Uploaded bywbvxloshark
- UT Dallas Syllabus for ee6341.001.08f taught by Aria Nosratinia (aria)Uploaded byUT Dallas Provost's Technology Group
- Lecture 9Uploaded byEd Z
- A Probability PrimerUploaded byAmber Habib
- thermlUploaded byzentropia
- SC0x M2Unit3 W7L2 ContinuousDist CLEANUploaded byPartha Das
- Data Structures and Algorithms in Java (6e) - Useful Mathematical Facts.pdfUploaded bySAP Fan
- Grade 8 Data Handling Probability Statistics InUploaded byShweta Rawat
- 2Uploaded byAnisa Berliana
- Lecture 3- ProbabilityUploaded byMark Reyes
- pro076-008Uploaded byEmanuel Gevara
- lec2Uploaded bysasitsn
- Jain Un Syllabus 4th SemUploaded byAnshul Lall
- ENME 392 - Homework 12 - Fa13_solutionsUploaded bySam Adams
- paper documentUploaded bydaniela
- chap04Uploaded byÜmit Mustafa Güzey
- Normal Probability PlotsUploaded byAbel J. Urbán Ríos
- Communications S&SUploaded byML Narasimham
- DA 1Uploaded byviktahjm

- APM 504 - PS4 SolutionsUploaded bySyahilla Aziz
- Appendix DeUploaded byShu Shujaat Lin
- Tut 14 Binomial and Poisson Distn_SolutionsUploaded byselmerparis
- Materi ProbabilityUploaded byTri Utami
- girsanovUploaded bySun Ray
- Introduction Mm(m) Coursework t2Uploaded byTobaccotyc
- 2012f Lebesgue Integrals Lecture NoteUploaded byKelvin Jhonson
- 438717-c1Uploaded byAdem Ileri
- Lecture 3Uploaded byankit.batra887
- Mathcad for in Class Examples in a Random Processes Course (1)Uploaded byChristine Petate
- Chapter 1 ~ Mathematics of Survival AnalysisUploaded byRandy Lim
- Problem Sheet 5Uploaded byTanishq Jasoria
- Statistics ExercisesUploaded byman420
- LectureUploaded byRaghuveer Chandra
- Discrete mathUploaded byKhairah A Karim
- Lecture Random ProcessesUploaded byYongfu Wind
- Stochastic ProcessUploaded byFRANCESCO222
- Chapter 02 answers.pdfUploaded byadow
- Probability and Random VariablesUploaded byRaja Ramachandran
- Discrete Probability Distributions.docxUploaded byAzim
- Estimation and Detection Theory by Don H. JohnsonUploaded byPraveen Chandran C R
- Z-Chart & Loss Function TablesUploaded bygkk82
- OPRE 6301-SYSM 6303 Chapter 07 Slides_studentsUploaded bySrinivas Reddy
- Probability NotesUploaded byTomas Kojar
- The Poisson Distribution (Group 17)Uploaded byfellalaluna
- Central Limit TheoremUploaded byRamchandra Reddy Vanteru
- Calculo-David Applebaum-Levy Processes and Stochastic CalculusUploaded byconrado
- RP1 IntroductionUploaded byMalik Muhammad Zaeem
- Stats 210 Course BookUploaded byFiona Feng
- IDF Relationships Using Bivariate Copula for Storm Events in Peninsular MalaysiaUploaded byShiYun Kuan