Lec2 IntroToProbabilityAndStatistics

Introduction to
Probability and Statistics

(Continued)
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
August 18, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Random Variables, CDF, Mean and Variance
 Uniform Distribution, Gaussian, The binomial and Bernoulli distributions
 The multinomial and multinoulli distributions, Poisson distribution
 Empirical distribution
 The goals for today’s lecture:
 Understand the definition of random variables, CDFs and PDFs

 Learn about the mean, variance and higher-moments of a distribution
 Familiarize ourselves with the Uniform, Gaussian, Bernoulli, Binomial, Multinomial,
Multinoulli, Poisson and Empirical distributions
 Get a first look at the concept of maximum likelihood for parameter estimation
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.
• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientiﬁc. 2nd Edition
• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.

Random Variables
 Define Ω to be a probability space equipped with a probability measure 𝑃 that
measures the probability of events 𝐸 ⊂ Ω. Ω contains all possible events in the
form of its own subsets.
 A real valued random variable 𝑋 is a mapping 𝑋: Ω → ℝ.
 We call 𝑥 = 𝑋 𝜔 , 𝜔 ∈ ℝ, a realization of 𝑋.
 Probability distribution of 𝑋: For 𝐵 ⊂ ℝ,

PX ( B)  P  X 1 ( B )   P  X ( )  B
 Probability density
PX ( B)   p X ( x)dx
B
 We often write:
p X ( x)  p( x)
Cummulative Density Function
z
p  x  0 F  z   p( x)dx
 
 p( x)dx  1

(cumulative density function)
b
P  x  (a, b)    p ( x)dx  F (b)  F (a )
a
 The CDF for a random variable 𝑋 is the function 𝐹(𝑥) that returns the
probability that 𝑋 is less than 𝑥.

Mean and Variance
 The expected (mean) value of 𝑋 is:

 X    xpX ( x)dx

 The variance of 𝑋 is defined as:

var  X    X   X 
2
  X 2   X 
2
 
 The standard deviation is often useful defined as
std  X   var[ X ]

Mean and Variance
 If 𝑇(𝑋) is a function of 𝑋 (called a statistic), the mean of 𝑇(𝑋) is given by

T ( X )   T ( x) p X ( x)dx

 The variance of 𝑇(𝑋) is also defined as:
var T ( X )   T ( X )  (T ( X )) 2   T ( X ) 2    T ( X ) 

2
 
 The 𝑘th moment is defined as: (𝑘 = 3, skewness, 𝑘 = 4, kurtosis)


 X  [ X ]   
 
k k
( x [ X ]) p X ( x)dx
 


Expectation
 For the discrete case
 f    p( x) f ( x)
x
 For the continuous case

 f    p( x) f ( x)dx
 Conditional expectation
x  f | y    p( x | y ) f ( x)
x
x  f | y    p( x | y) f ( x)dx

Expectation
 The expectation of a random variable is not necessarily the value that we
should expect a realization to have. Let Ω = [−1,1] with the probability 𝑃 being
uniform: 1 1
2 I
P( I )  dx  I , I  [1,1]
2
 Consider the following two random variables:

X 1 :[1,1]  , X 1 ( )  1 
2   0
X 2 :[1,1]  , X 2 ( )  
0   0
 We can see that
1 1
1 1
 1  1  
X  X  dx   1 dx  1,
1
2 1
2
1 1
1 1
 X 2    X 2   dx   2 dx  1
2 2
although 0 0
X 2 ( )  1   .
Uniform Random Variable
 Consider the Uniform random variable 𝒰 𝑥 𝑎, 𝑏 :
1
U ( x | a, b)  ( a  x  b)
ba
 x for 0  x  1,
 What is the CDF of a uniform random variable 𝒰 𝑥 0,1 ? 
PU ( x)  1, for x  1
 0, otherwise

 You can show that the mean, 2nd moment and variance of 𝒰 𝑥 𝑎, 𝑏 are:
ab a 2  ab  b 2 (b  a ) 2
 x | a, b   ,  x | a, b  
2
, var  x | a, b  
2 3 12
 Note that it is possible for 𝑝(𝑥) > 1 but the density still needs to integrate to 1.
For example, note that
 1
U ( x | 0,1/ 2)  2  1 x  0, 
 2
The Gaussian Distribution
 A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝜇, 𝜎 2 if:
 1 2
t
1
P  X  t 
2 
exp   ( x   ) dx
2   2 
2
 The probability density is:

 1 
N  x,  ,  2  
1
exp   2 ( x   ) 2 
2 2  2 
To show that this density is normalized note the following trick:
  
 1   1   1 
Let I   exp   2 ( x   ) 2 dx, then : I 2    exp   2 ( y   ) 2  exp   2 ( x   ) 2  dxdy
  2     2   2 
 2 
 1 2  1 2
Set r  ( x   )  ( y   ) , then : I   0  2 2 
    0   2 2 r rdr
2 2 2 2
exp r rdrd 2 exp
0

 1 
Thus : I 2    exp   2 u du    2 2  e   1  2 2
0  2 
The Gaussian Distribution
 A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝜇, 𝜎 2 if:
 1 
N  x,  ,  2  
1
exp   2 ( x   ) 2 
2 2  2 
 The following can be shown easily with direct integration:


1  1 2
X   2 
exp   ( x   )  xdx   ,
2   2 2


1  1 2 2  X    2    2
 X       x dx   2   2 , var[ X ] 
2
exp  ( x )   
2 2   2 
2
The following integrals are useful in these derivations :

  

 exp  u du   ,  u exp  u du  0,  u exp  u du 
2 2 2 2
  
2
 We often work with the precision of a Gaussian 𝜆 = 1/𝜎2. The higher the 𝜆 the
narrower the distribution is.
CDF of a Gaussian
 Plot of the Standard Normal 𝒩(0,1) and CDF.
PDF
0.4
Run MatLab function gaussPlotDemo 0.35
from Kevin Murphys’ PMTK 0.3
0.25
0.2
0.15
CDF 0.1
N  x;0,1
100
90 0.05
80 0
-3 -2 -1 0 1 2 3
70
x
F  x;  ,     N ( z |  ,
60
2 2
50
)dz

40
30
20
F  x;  ,  2  
1
2
 
1  erf z / 2  , z  ( x   ) / 

  x;0,1
x
2
erf  x    dt
10
t 2
e
0
-3 -2 -1 0 1 2 3  0

CDF and the Standard Normal
 Assume 𝑋 ~ 𝒩(0,1)
 Define Φ(𝑥) = 𝑃(𝑋 < 𝑥) = Cumulative Distribution of 𝑋
x
 ( x)   p( z )dz
z  
1
x
 z2 

2 
z  
exp  dz
 2
𝑥−𝜇
 Assume 𝑋 ~ 𝒩(𝜇, 𝜎2). Then F x = 𝑃(𝑋 < 𝑥|𝜇, 𝜎2) = Φ .
𝜎

Univariate Gaussian
[X ]  μ
var[ X ]   2
1  ( x   )2 
p( x)  exp  
2   2 2

 Shorthand: We say 𝑋 ~ 𝒩(𝜇, 𝜎2) to mean “𝑋 is distributed as a Gaussian with

parameters 𝜇 and 𝜎2 ”.

Univariate Gaussian
 Representation of symmetric phenomena without p( x) 
1  ( x   )2 
exp  
2   2 2

long tails.
 Inappropriate for skewness, fat tails,
multimodality, etc.
 The popularity of the Gaussian is related to facts as
 Completely defined in terms of the mean and the variance
 The Central Limit Theorem shows that the sum of i.i.d. random variables has
approximately a Gaussian distribution making it an appropriate choice for
modeling noise (limit of additive small effects)
 The Gaussian distribution makes the least assumptions (maximum entropy) from
all distributions with given mean and variance.

The Normal Model
 The normal (or Gaussian) distribution 𝑋 ~ 𝒩(𝜇, 𝜎2)
1  1 2
exp   2  x    
2  2 
is one of the most studied and one of the most used distributions
 Sample 𝑥1, . . . , 𝑥𝑛 from a 𝒩 𝜇, 𝜎2 normal distribution 0.4
0.35
In a typical inference problem, we are interested in: 0.3
0.25
 Estimation of 𝜇, 𝜎
0.2
0.15
 Confidence regions on 𝜇, 𝜎 0.1
0.05
 Test on (𝜇, 𝜎) and comparison with other samples 0

-4 -3 -2 -1 0 1 2 3 4

Datasets: Normal Data
 Normaldata : Relative changes in reported larcenies between 1991 and 1995
(relative to 1991) for the 90 most populous US counties (Source: FBI)1
4
3.5
Normal data 2.5
Matlab implementation 2
1.5
0.5
0
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 2 (available online)
Datasets: CMBData
 CMBdata: Spectral representation of the cosmological microwave background
(CMB), i.e. electromagnetic radiation from photons back to 300,000 years
after the Big Bang, expressed as difference in apparent temperature from the
mean temperature.1 6
1
 MLE based
0.9
5 estimates of
0.8 𝜇 and 2 used in
0.7 4
constructing the
0.6
Gaussian shown
0.5 3
with the solid line
0.4
 Maximum
0.3
2
0.2
0.1
Likelihood
1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Estimators (MLE)
will be discussed
CMBdata 0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 in a follow up
Matlab implementation lecture

Normal estimation
From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 2 (available online)
Quantiles
 Recall that the probability density function 𝑝𝑋(·) of a random variable 𝑋 is
defined as the derivative of the cumulative density function, so that
x0
F ( x0 )  p

X ( x)dx
 The value 𝑦(𝛼) such that 𝐹(𝑦(𝛼) ) = 𝛼 is called the a-quantile of the
distribution with CDF 𝐹. The median is of course 𝐹(𝑦 0.5 ) = 0.5
  One can define tail area probabilities. The shaded
 Note that PX ()  p

X ( x)dx  1 regions each contain 𝛼/2 of the probability mass.
For 𝒩(0, 1), the leftmost cutoff point is Φ−1 (𝛼/2),
where Φ is the cdf of 𝒩(0, 1).
 If 𝛼 = 0.05, the central interval is 95%, and the

left cutoff is −1.96 and the right is 1.96. Figure
generated by quantileDemo from PMTK
a/2 a/2
 For 𝒩(𝜇, 𝜎2), the 95% interval becomes (𝜇 −

-1(a/2) 0 -1(1-a/2) 1.96𝜎, 𝜇 + 1.96𝜎) – often approximated as (𝜇2𝜎)
Binary Variables
 Consider a coin flipping experiment with heads = 1 and tails = 0. With 𝜇 ∈
[0,1]
p ( x  1|  )  
p( x  0 |  )  1  
 This defines the Bernoulli distribution as follows:
Bern ( x |  )   x (1   )1 x
 Using the indicator function, we can also write this as:
Bern ( x |  )   ( x 1)
(1   ) ( x  0)

Bernoulli Distribution
 Recall that in general
[ f ]   p( x) f ( x), [ f ]   p( x) f ( x)dx
x
var[ f ]  [ f ( x) 2 ]  [ f ( x)]2
 For the Bernoulli distribution Bern ( x |  )   x (1   )1 x, we can easily show
from the definitions:
[ x]  
var[ x]   (1   )
[ x]   
x0,1
p ( x |  ) ln p( x |  )    ln   (1   ) ln(1   )
 Here H[𝑥] is the “entropy of the distribution”.

Likelihood Function for Bernoulli Distribution
 Consider the i.i.d. data set D   x1 , x2 ,..., xN  in which we have 𝑚 heads (𝑥 =
1), and 𝑁 − 𝑚 tails (𝑥 = 0)
 The likelihood function takes the form:
N N
p (D |  )   p ( xn |  )   xn (1   )1 xn  m (1   ) N  m
n 1 n 1

Binomial Distribution
 Consider the discrete random variable 𝑋 ∈ 0,1,2, … , 𝑁 .
 We define the Binomial distribution as follows:

binomial distribution
0.35
N m
Bin ( X  m | N ,  )     (1   ) N  m 0.3
m 0.25 Bin ( N  10,   0.25)

 In our coin flipping experiment,
0.2
0.15
it gives the probability in 𝑁 flips to 0.1
get 𝑚 heads with 𝜇 being the Matlab Code

0.05
probability getting heads in one flip. 0

0 1 2 3 4 5 6 7 8 9 10
 It can be shown (see S. Ross, Introduction to Probability Models) that the limit
of the binomial distribution as 𝑁 → ∞, 𝑁𝜇 → 𝜆, is the Poisson(l) distribution.

Binomial Distribution
 The Binomial distribution for 𝑁 = 10, and 𝜇 ∈ 0.25,0.9 is shown below using
MatLab function binomDistPlot from Kevin Murphys’ PMTK
 =0.250
 0.25  =0.900
0.9
0.35 0.4
0.3 0.35
0.25 0.3
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
Bin ( N ,  )

Mean,Variance of the Binomial Distribution
 Consider for independent events the mean of the sum is the sum of the
means, and the variance of the sum is the sum of the variances.
 Because 𝑚 = 𝑥1 + . . . + 𝑥𝑁, and for each observation the mean and variance
are known from the Bernoulli distribution:
N
[m]   mBin (m | N ,  )  [ x1  ...  xN ]  N 
m0
N
var[m]    m  [m] Bin (m | N ,  )  var[ x1  ...  xN ]  N  (1   )
2
m0
 One can also compute [m], [m 2 ] by differentiating (twice) the identity

wrt 𝜇. Try it!
N
N m
    (1   )
m0  m 
N m
1

Binomial Distribution: Normalization
To show that the Binomial is correctly normalized, we use the following:
 N   N   N  1
 With direct substitution can show:     (*)
 n   n  1  n 
N m
N
 Binomial theorem: (1  x)     x (**)

N
m0  m 
 This theorem is proved by induction using (*) and noticing:

N
N m N
 N  m N  N  m 1 N  N  m N 1  N  m
(1  x) N 1
    x (1  x)     x     x     x    x 
m0  m  m0  m  m0  m  m0  m  m 1  m  1 
 N m  N  N  m N 1   N  1 m N 1 N  1
*
N N
  m
      
N 1
1  x   x  x   1    x  x   x
 m 1  m    m 1  m  1  m 1  m  m 0  m 
 To finally show normalization using (**):

m N
N
N m N
 N    N   
    (1   ) N m
 (1   ) N
   
m 0  m   1   
 (1   )  1   1
m0  m   1  
Generalization of the Bernoulli Distribution
We are now looking at discrete variables that can take on one of 𝐾 possible
mutually exclusive states.
The variable is represented by a 𝐾-dimensional vector 𝒙 in which one of the

elements 𝑥𝑘 equals 1, and all remaining elements equal 0: 𝒙 =
0,0, … , 1,0, … , 0 𝑇 .
K
These vectors satisfy: x

k 1
k 1
 Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then

K K K
p( x |  )      k
k 1
xk
k
k 1
( xk 1)
, 
k 1
k  1, k  0
where 𝝁 = 𝜇1 , … , 𝜇𝐾 𝑇 .

Multinoulli/Categorical Distribution
The distribution is already normalized:
K K
 p( x |  )    
x x k 1
xk
k   k  1
k 1
 The mean of the distribution is computed as:

 x |     xp( x |  )   1 ,...,  K 
T

x
This is similar to the result for the Bernoulli distribution.
 The Multinoulli also known as the Categorical distribution often denoted as

(Mu here is the multinomial distribution):
Cat  x |    Multinoulli  x |    Mu  x |1,  
 The parameter 1 stands to emphasize that we roll a 𝐾 −sided dice once (𝑁 =
1) – see next for the multinomial distribution.
Likelihood: Multinoulli Distribution
 Let us consider an i.i.d. data set D   x1 ,..., xN  . The likelihood becomes:
N
N K K  xnk K N
p(D |  )    xnk
k   kn1   , where : mk  xnk
mk
k
n 1 k 1 k 1 k 1 n 1
is the # of observations of 𝑥𝑘 = 1.
 𝑚𝑘 is the “sufficient statistic” of the distribution.

MLE Estimate: Multinoulli Distribution
To compute the maximum likelihood (MLE) estimate of 𝝁, we maximize an
augmented log-likelihood
 K  K  K 
ln p (D |  )  l   k  1   mk ln k  l   k  1
 k 1  k 1  k 1 
mk
 Setting the derivative wrt 𝜇𝑘 equal to zero: k  
l
 Substitution into the constraint
K
K m k K k 
mk

mk

k 1
k 1  k 1
l
 1  l   mk 
k 1 m
K
k
N
k 1
As expected, this is
the fraction in the 𝑁
observations of 𝑥𝑘 = 1
Multinomial Distribution
 We can also consider the joint distribution of 𝑚1, … , 𝑚𝐾 in 𝑁 observations
conditioned on the parameters 𝝁 = (𝜇1, … , 𝜇𝐾).
 From the expression for the likelihood given earlier

K
p (D |  )    kmk
k 1
the multinomial distribution Mu (m1 ,..., mK | N ,  ) with parameters 𝑁 and 𝝁 takes

the form:
N!
p (m1 , m2 ,..., mK | N , 1 , 2 ,...,  K )  1m1 2m2 ... Kmk where  k 1 mk  N
K
m1 !m2 !...mK !

Example: Biosequence Analysis
 Consider a set of DNA cgat acg gggtcgaa
Sequences
caat ccg agatcgca
sequences where there caat ccg t g t t g g g a N=1:10
are 10 rows (sequences) caat cgg catgcggg
and 15 columns (locations cgagccg cgtacgaa
along the genome). cat acgg agcacgaa
t aat ccg ggcatgta
 Several locations are cgagccg agtacaga
ccat ccg cgtaagca
conserved by evolution ggat acg agatgaca
(e.g., because they are part Location along the genome
of a gene coding region), since
the corresponding columns tend to be pure e.g., column 7 is all 𝑔’s.
 To visuallize the data (sequence logo), we plot the letters 𝐴, 𝐶, 𝐺 and 𝑇 with a
font size proportional to their empirical probability, and with the most probable
letter on the top.
Example: Biosequence Analysis
 The empirical probability distribution at location 𝑡, is obtained by normalizing
the vector of counts (see MLE estimate)
𝑁 𝑁 𝑁 𝑁
1
෡𝑡 =
𝜽 ෍𝕀(𝑋𝑖𝑡 = 1) , ෍𝕀(𝑋𝑖𝑡 = 2) , ෍𝕀(𝑋𝑖𝑡 = 3) , ෍𝕀(𝑋𝑖𝑡 = 4)
𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
2
 This distribution is known

as a motif.
 Can also compute the
Bits
1
most probable letter

in each location; this is the
consensus sequence.
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position
Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK

Summary of Discrete Distributions
 A summary of the multinomial and related discrete distributions is summarized
below on a Table from Kevin Murphy’s textbook
 𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)
 𝐾 = 1 (binary variables), 𝐾 = − (1 −of−𝐾 encoding)

The Poisson Distribution
 We say that 𝑋 ∈ {0,1,2,3, . . } has a Poisson distribution with parameter 𝜆 > 0,
if its pmf is
l lx
X ~ Poi (l ) : Poi ( x | l )  e
x!
 This is a model for counts of rare events.

Poi(l=1.000) Poi(l=10.000)
0.4 0.14
0.35 0.12
0.3
0.1
0.25
0.08
0.2
0.06
0.15
0.04
0.1
0.02
0.05
0
0 0 5 10 15 20 25 30
0 5 10 15 20 25 30
Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK

The Empirical Distribution
 Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, 𝑥𝑖 ~𝑝 𝑥 , define the empirical distribution as:
1 N
1 if xi  A
pemp ( A)    xi ( A), Dirac Measure :  xi ( A)  
N i 1 0 if xi  A
 We can also associate weights with each sample:
N N N
1
Generalize pemp ( x) 
N
 i 1
xi ( x)  pemp ( x)   wi xi ( x) , 0  wi  1,  wi  1
i 1 i 1
 This corresponds to a histogram with spikes at each sample point with height
equal to the corresponding weight. This distribution assigns zero weight to
any point not in the dataset.
 Note that the “sample mean of 𝑓(𝑥)” is the expectation of 𝑓(𝑥) under the
empirical distribution:
N N
1 1
[ f ( x)]  pemp ( x ) [ f ( x)]   f ( x)  xi ( x)dx   f (x )
i
i 1 N N i 1

Lec2 IntroToProbabilityAndStatistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec2 IntroToProbabilityAndStatistics

Uploaded by

Copyright:

Available Formats

Introduction to

Probability and Statistics

August 18, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Uniform Distribution, Gaussian, The binomial and Bernoulli distributions

 The multinomial and multinoulli distributions, Poisson distribution

 The goals for today’s lecture:

 Understand the definition of random variables, CDFs and PDFs

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

 A real valued random variable 𝑋 is a mapping 𝑋: Ω → ℝ.

 Probability distribution of 𝑋: For 𝐵 ⊂ ℝ,

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

 The variance of 𝑋 is defined as:

 The standard deviation is often useful defined as

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

 The variance of 𝑇(𝑋) is also defined as:

var T ( X )   T ( X )  (T ( X )) 2   T ( X ) 2    T ( X ) 

 The 𝑘th moment is defined as: (𝑘 = 3, skewness, 𝑘 = 4, kurtosis)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

 For the continuous case

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

 Consider the following two random variables:

 The probability density is:

 The following can be shown easily with direct integration:

The following integrals are useful in these derivations :

Run MatLab function gaussPlotDemo 0.35

from Kevin Murphys’ PMTK 0.3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

 Shorthand: We say 𝑋 ~ 𝒩(𝜇, 𝜎2) to mean “𝑋 is distributed as a Gaussian with

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

 Sample 𝑥1, . . . , 𝑥𝑛 from a 𝒩 𝜇, 𝜎2 normal distribution 0.4

In a typical inference problem, we are interested in: 0.3

 Confidence regions on 𝜇, 𝜎 0.1

 Test on (𝜇, 𝜎) and comparison with other samples 0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Normal data 2.5

Matlab implementation lecture

 If 𝛼 = 0.05, the central interval is 95%, and the

 For 𝒩(𝜇, 𝜎2), the 95% interval becomes (𝜇 −

 This defines the Bernoulli distribution as follows:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

 Here H[𝑥] is the “entropy of the distribution”.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

 The likelihood function takes the form:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

 We define the Binomial distribution as follows:

m 0.25 Bin ( N  10,   0.25)

it gives the probability in 𝑁 flips to 0.1

get 𝑚 heads with 𝜇 being the Matlab Code

probability getting heads in one flip. 0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

 One can also compute [m], [m 2 ] by differentiating (twice) the identity

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

 Binomial theorem: (1  x)     x (**)

 This theorem is proved by induction using (*) and noticing:

 To finally show normalization using (**):