Lec3 IntroToProbabilityAndStatistics

Introduction to
Probability and Statistics

(Continued)
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
August 27, 2018
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
 The binomial and Bernoulli distributions
 The multinomial and multinoulli distributions
 The Poisson distribution
 Student’s T
 Laplace distribution
 Gamma distribution
 Beta distribution
 Pareto distribution
 Introduction to Covariance and Correlation
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.
• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientiﬁc. 2nd Edition
• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.
Binary Variables
 Consider a coin flipping experiment with heads = 1 and
tails = 0. With   [0,1]
p ( x  1|  )  
p( x  0 |  )  1  
 This defines the Bernoulli distribution as follows:
Bern ( x |  )   x (1   )1 x
 Using the indicator function, we can also write this as:
Bern ( x |  )   ( x 1)
(1   ) ( x  0)
Bernoulli Distribution
 Recall that in general
[ f ]   p( x) f ( x), [ f ]   p( x) f ( x)dx
x
var[ f ]  [ f ( x) 2 ]  [ f ( x)]2
 For the Bernoulli distribution Bern ( x |  )   x (1   )1 x , we
can easily show from the definitions:
[ x]  
var[ x]   (1   )
[ x]   
x0,1
p ( x |  ) ln p( x |  )    ln   (1   ) ln(1   )
 Here H[𝑥] is the “entropy of the distribution”

Likelihood Function for Bernoulli Distribution
 Consider the data set
D   x1 , x2 ,..., xN 
in which we have 𝑚 heads (𝑥 = 1), and 𝑁 − 𝑚 tails (𝑥 = 0)
 The likelihood function takes the form:
N N
p (D |  )   p ( xn |  )   xn (1   )1 xn  m (1   ) N  m
n 1 n 1
Binomial Distribution
 Consider the discrete random variable X 0,1, 2,..., N 
 We define the Binomial distribution as follows: binomial distribution

0.35
N m
Bin ( X  m | N ,  )     (1   ) N  m
0.3
m 0.25 Bin ( N  10,   0.25)

0.2
 In our coin flipping experiment, 0.15
it gives the probability in 𝑁 flips to 0.1

Matlab Code
get 𝑚 heads with 𝜇 being the 0.05
probability getting heads in one flip. 0

0 1 2 3 4 5 6 7 8 9 10
 It can be shown (see S. Ross, Introduction to Probability

Models) that the limit of the binomial distribution as
N  , N   l , is the Poisson(l) distribution.
Binomial Distribution
 The Binomial distribution for 𝑁 = 10, and   0.25, 0.9 is
shown below using MatLab function binomDistPlot from
Kevin Murphys’ PMTK
 =0.250
0.25  =0.900
0.9
0.35 0.4
0.3 0.35
0.25 0.3
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
Bin ( N ,  )
Mean,Variance of the Binomial Distribution
 Consider for independent events the mean of the sum is
the sum of the means, and the variance of the sum is the
sum of the variances.
 Because 𝑚 = 𝑥1 + . . . + 𝑥𝑁, and for each observation the

mean and variance are known from the Bernoulli
distribution:
N
[m]   mBin (m | N ,  )  [ x1  ...  xN ]  N 
m0
N
var[m]    m  [m] Bin (m | N ,  )  var[ x1  ...  xN ]  N  (1   )
2
m0
 One can also compute [m], [m 2 ] by differentiating

N N
(twice) the identity   m   (1   )  1 wrt 𝜇. Try it!

m 1
m N m
Binomial Distribution: Normalization
To show that the Binomial is correctly normalized, we use the
following identities:
 N   N   N  1
 Can be shown with direct substitution:     (*)
  
n n  1   n 
N
N
 Binomial theorem: (1  x)     x m (**)

N
m0  m 
This theorem is proved by induction using (*) and noticing:

N
N m N
 N  m N  N  m 1 N  N  m N 1  N  m
(1  x) N 1
    x (1  x)     x     x     x    x 
m0  m  m0  m  m0  m  m0  m  m 1  m  1 
 N m  N  N  m N 1   N  1 m N 1 N  1
*
N N
  m
      
N 1
1  x   x  x   1    x  x   x
 m 1  m    m 1  m  1  m 1  m  m 0  m 
 To finally show normalization using (**):

m N
N
N m N
 N    N   
    (1   ) N m
 (1   ) N
   
m 0  m   1   
 (1   )  1   1
m0  m   1  
Generalization of the Bernoulli Distribution
We are now looking at discrete variables that can take on
one of 𝐾 possible mutually exclusive states.
The variable is represented by a 𝐾-dimensional vector 𝒙 in

which one of the elements 𝑥𝑘 equals 1, and all remaining
elements equal 0: 𝒙 = 0,0, … , 1,0, … , 0 𝑇
K
These vectors satisfy: x

k 1
k 1
 Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then

K K K
p( x |  )      k
k 1
xk
k
k 1
( xk 1)
, 
k 1
k  1, k  0
where 𝝁 = 𝜇1 , … , 𝜇𝐾 𝑇 .
Multinoulli/Categorical Distribution
 The distribution is already normalized:
K K
 p( x |  )    
x x k 1
xk
k   k  1
k 1
 The mean of the distribution is computed as:

 x |     xp( x |  )   1 ,...,  K 
T

x
similar to the result for the Bernoulli distribution.
 The Multinoulli also known as the Categorical distribution

often denoted as (Mu here is the multinomial distribution):
Cat  x |    Multinoulli  x |    Mu  x |1,  
 The parameter 1 stands to emphasize that we roll a 𝐾-sided

dice once (𝑁 = 1) – see next for the multinomial distribution.
Likelihood: Multinoulli Distribution
 Let us consider a data set D   x1 ,..., xN  . The likelihood
becomes:
N
N K K  xnk K N
p(D |  )    xnk
k   k n1
  , where : mk  xnk
mk
k
n 1 k 1 k 1 k 1 n 1
is the # of observations of 𝑥𝑘 = 1.
 mk is the “sufficient statistic” of the distribution.
MLE Estimate: Multinoulli Distribution
 To compute the maximum likelihood (MLE) estimate of 𝝁, we
maximize an augmented log-likelihood
 K  K  K 
ln p (D |  )  l   k  1   mk ln k  l   k  1
 k 1  k 1  k 1 
mk
 Setting the derivative wrt 𝜇𝑘 equal to zero: K  
l
Substitution into the constraint
K
K m k K K 
mk

mk

k 1
k 1  k 1
l
 1  l   mk 
k 1 m
K
k
N
k 1
As expected, this is
the fraction in the 𝑁
observations of 𝑥𝑘 = 1
Multinomial Distribution
 We can also consider the joint distribution of 𝑚1, … , 𝑚𝐾 in 𝑁
observations conditioned on the parameters 𝝁 =
(𝜇1, … , 𝜇𝐾).
 From the expression for the likelihood given earlier

K
p (D |  )    kmk
k 1
the multinomial distribution Mu (m1 ,..., mK | N ,  ) with

parameters 𝑁 and 𝝁 takes the form:
N!
p (m1 , m2 ,..., mK | N , 1 , 2 ,...,  K )  1m1 2m2 ... Kmk where  k 1 mk  N
K
m1 !m2 !...mk !
Example: Biosequence Analysis
cgat acg gggtcgaa
 Consider a set of DNA caat ccg agatcgca
Sequences
sequences where there caat ccg t g t t g g g a N=1:10

are 10 rows (sequences) caat cgg catgcggg
and 15 columns (locations cgagccg cgtacgaa
cat acgg agcacgaa
along the genome).
t aat ccg ggcatgta
Several locations are cgagccg agtacaga
conserved by evolution ccat ccg cgtaagca
ggat acg agatgaca
(e.g., because they are part Location along the genome
of a gene coding region), since the corresponding columns

tend to be pure e.g., column 7 is all 𝑔’s.
 To visuallize the data (sequence logo), we plot the letters
𝐴, 𝐶, 𝐺 and 𝑇 with a font size proportional to their empirical
probability, and with the most probable letter on the top.
Example: Biosequence Analysis
 The empirical probability distribution at location 𝑡, is
obtained by normalizing the vector of counts (see MLE
estimate)
𝑁 𝑁 𝑁 𝑁
1
෡𝑡 =
𝜽 ෍𝕀(𝑋𝑖𝑡 = 1) , ෍𝕀(𝑋𝑖𝑡 = 2) , ෍𝕀(𝑋𝑖𝑡 = 3) , ෍𝕀(𝑋𝑖𝑡 = 4)
𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
2
 This distribution is known

as a motif.
 Can also compute the

Bits
1
most probable letter

in each location; this is the
consensus sequence.
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position
Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK
Summary of Discrete Distributions
 A summary of the multinomial and related discrete
distributions is summarized below on a Table from Kevin
Murphy’s textbook
 𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)
 𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)
The Poisson Distribution
 We say that X  0,1, 2,3,... has a Poisson distribution with
parameter 𝜆 > 0, if its pmf is
l lx
X ~ Poi (l ) : Poi ( x | l )  e
x!
 This is a model for counts of rare events.

Poi(l=1.000) Poi(l=10.000)
0.4 0.14
0.35 0.12
0.3
0.1
0.25
0.08
0.2
0.06
0.15
0.04
0.1
0.02
0.05
0
0 0 5 10 15 20 25 30
0 5 10 15 20 25 30
Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK
The Empirical Distribution
 Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical
distribution as:
1 N
1 if xi  A
pemp ( A)    xi ( A), Dirac Measure :  xi ( A)  
N i 1 0 if xi  A
 We can also associate weights with each sample:
N N N
1
Generalize pemp ( x) 
N
 i 1
xi ( x)  pemp ( x)   wi xi ( x) , 0  wi  1,  wi  1
i 1 i 1
 This corresponds to a histogram with spikes at each
sample point with height equal to the corresponding
weight. This distribution assigns zero weight to any point
not in the dataset.
 Note that the “sample mean of 𝑓(𝑥)” is the expectation of
𝑓(𝑥) under the empirical distribution:
N N
1 1
[ f ( x)]   f ( x)  xi ( x)dx   f (x ) i
i 1 N N n 1
Student’s T Distribution

1 2  /2 1/2
(  )
 l   lx   
1/2
p ( x |  , l , )  2 2 1  
     
( )  
2
 The parameter 𝜆 is called the precision of the 𝒯-distribution, even

though it is not in general equal to the inverse of the variance (see
below on behavior as 𝜐 → ∞).
 The parameter 𝜐 is called the degrees of freedom.
 For the particular case of 𝜐 = 1, the T-distribution reduces to the

Cauchy distribution.
 In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian

𝒩(𝑥|𝜇, 𝜆−1 ) with mean 𝜇 and precision 𝜆.
For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian

1 2  /2 1/2
(  )
 l   lx   
1/2
p ( x |  , l , )  2 2 1  
     
( )  
2
 We first write the distribution as follows:
 /2 1/2
 l  x   2     1  l  x   2  
T ( x |  , l , )  1    exp  ln 1  
    2    
 For large 𝜐, we can approximate the log as follows:

   1  l  x   2    
 l   
2

x 1 
T ( x |  , l , )  exp    O( )    exp 
2
 O( ) 
 2      2 
 In the limit 𝜐 → ∞, the T-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) is indeed a Gaussian
𝒩(𝑥|𝜇, 𝜆−1 ) with mean 𝜇 and precision 𝜆. The normalization of the T is
valid in this limit as well (so the Gaussian obtained is normalized).
 1 2  /2 1/2
(  )  lx   
 l 
1/2
p ( x |  , l , )  2 2 1  
     
( )  
2
student distribution
0.4
v=10
0.35 v=1.0
v=0.1
Mean :  ,   1 0.3
  0, l  1
Mode :  0.25
For   , we
 0.2
obtain N (  , l 1 )
Var : ,  2
l   2  0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
MatLab Code
Student’s T Vs the Gaussian
prob. density functions
 We plot: 0.8
Gauss
Student
 
0.7
N  x | 0,1 , T ( x | 0,1,1), Lap x | 0,1/ 2

Laplace
0.6
0.5
 The mean and variance of the 0.4
Student’s is undefined for 𝜐 = 1. 0.3
0.2
 Logs of the PDFs. The Student’s 0.1
is NOT log concave. -4 -3 -2 -1 0 1
log prob. density functions

2 3 4
0
Run MatLab function studentLaplacePdfPlot Gauss
-1 Student
from Kevin Murphys’ PMTK Laplace
-2
 When 𝜐 = 1, the distribution is -3
known as Cauchy or Lorentz. -4
-5
Due to its heavy tails, the mean -6
does not converge. -7
-8
 Recommended to use 𝜐 = 4. -9
-4 -3 -2 -1 0 1 2 3 4

p ( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d 
0

    
1/2
2 b
a
   exp    x      a 1  b
e d
0
2   2  (a )
 The Student’s T distribution can be seen from the equation

above (see following two slides for proof) as an infinite
mixture of Gaussians each of them with different
precision (governed by a Gamma distribution)
The result is a distribution that in general has longer ‘tails’
than a Gaussian.
This gives the T-distribution robustness, i.e. the T-
distribution is much less sensitive than the Gaussian to the
presence of outliers.
Appendix: Student’s T as a Mixture of Gaussians
 If we have a univariate Gaussian 𝒯(𝑥|𝜇, 𝜏 −1 ) together with a prior
Gamma(𝜏 |𝑎, 𝑏) and we integrate out the precision, we obtain the
marginal distribution of 𝑥

p ( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d 
0

    
1/2
2 b
a
   exp    x      a 1  b
e d 
0
2   2  (a )
 1 2
 Introduce the transformation z  b
 2  x      to simplify as:
A
1/2 
ba  1 
p ( x |  , a, b)      1/2
exp   z   a 1
d 
(a)  2  0
1/2 
b  1 
a
1
   1/2  a 11 
z 1/2
exp   z  z a 1
dz
(a)  2  A 0
Appendix: Student’s T as a Mixture of Gaussians
1/2 
ba  1  1
p ( x |  , a, b)    1/2  a 11 
z 1/2
exp   z  z a 1
dz
(a)  2  A 0
1/2  a 1/2 
ba  1   1 2
 b   x        a 1/2
  exp  z z dz
(a)  2   2  0

 Recalling the definition of the Gamma function: (a)  exp   z  z a 1dz
 0
1/2  a 1/2
ba  1   1 2 1
p ( x |  , a, b)    b   x    (a  )
(a)  2   2  2
a
 It is common to redefine the parameters in this distribution as:   2a, l 
 1 b
(  ) 2  /2 1/2
 l   lx   
1/2
p ( x |  , l , )  2 2 1  
     
( )  
2
Robustness of Student’s T Distribution
 The robustness of the T-distribution is illustrated here by comparing
the “maximum likelihood solutions” for a Gaussian and a T-distribution
(30 data points from the Gaussian are used).
 The effect of a small number of outliers (Fig. on the right) is less

significant for the T-distribution than for the Gaussian.
0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
T-distribution & 0.25
0.2
0.2 Gaussian
0.15
Gaussian nearly
0.15
0.1 overlap 0.1
0.05
0.05
0
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
MatLab Code
T-distribution
Robustness of Student’s T Distribution
 The earlier simulation is repeated here with the PMTK toolbox.
0.5
gaussian
student T
laplace
0.4
0.5
gaussian
0.3 student T
laplace
0.4
0.2
0.3
0.1
0.2
0
-5 0 5 10
0.1
0
-5 0 5 10
Run MatLab function robustDemo

from Kevin Murphys’ PMTK
The Laplace Distribution
 Another distribution with heavy tails is the Laplace distribution, also
known as the double sided exponential distribution. It has the following
pdf:
x
1 
Lap  x |  , b   e b
2b
 𝜇 is a location parameter and 𝑏 > 0 is a scale parameter
Mean   , Mode   , Var  2b 2
 Its robust to outliers (see earlier demonstration).
 It puts mores probability density at 0 than the Gaussian. This property

is a useful way to encourage sparsity in a model.
Beta Distribution
 The Beta(𝛼, 𝛽) distribution with x  [0,1],  ,   0 is defined as
follows:
(   )  1  1 x 1 (1  x)  1       
( x)  x (1  x)  , beta ( ,  ) 
( )(  ) beta( ,  )     
Normalizing
factor
 The expected value, mode and variance of a Beta

random variable 𝑥 with (hyper-) parameters 𝛼 and 𝛽 :
 a
beta distributions
-1
 
3
 mode éë x ùû =
a=0.1, b=0.1
x , a=1.0, b=1.0
  a +b -2
2.5 a=2.0, b=3.0
a=8.0, b=4.0

var  x   1.5
    (    1)
2
1
0.5
 For more information visit this link. 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta Distribution
 If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.
 If 𝛼 and 𝛽 are both less than 1, we get a bimodal

distribution with spikes at 0 and 1.
beta distributions
3
 If 𝛼 and 𝛽 are both greater a=0.1, b=0.1
a=1.0, b=1.0
than 1, the distribution 2.5 a=2.0, b=3.0 𝛼,𝛽>1
a=8.0, b=4.0
is unimodal. 2
𝛼,𝛽<1
Run betaPlotDemo 1.5
from PMTK
1
𝛼=𝛽=1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Beta Distribution
3 3
Beta(0.1,0.1) Beta(1,1)
2.5 2.5
2 2 See Matlab implementation

pdf
1.5
pdf
1.5
1
1
0.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
x
3
Beta(2,3) 3
Beta(8,4)
2.5
2.5
2
2
pdf
1.5
pdf
1.5
1
1
0.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Gamma Function
(   )  1
( x)  x (1  x)  1
( )(  )
 The gamma function extends the factorial to real numbers:


( x)   u x 1e u du
0
 With integration by parts:

( x  1)  x( x)
 For integer 𝑛:
(n)  (n  1)!
 For more information visit this link.
Beta Distribution: Normalization
 Showing that the Beta(𝛼, 𝛽) distribution is normalized
correctly is a bit tricky. We need to prove that:
1
1   
 1
( )(  )  (   )    1
d
0
Follow the steps:
 (a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;
 (b) change the order of integration in the shaded
triangular region;
 and (c) change 𝑥 to m via 𝑥 = 𝑡𝜇: t
𝑡=𝑥
  1  x   1  y 
 
 1 

 
( )(  )    x e dx   y e dy    x   e  t  x  dt  dx 
t  1
x
0  0  y t  x 0 x 

 t  1  t   1
    x e  t  x  dx  dt   t e t tdt    1 1    d  
 1  1  t  1  1
00  0 0
1
 (   )    1 1   
 1
d
0
Gamma Distribution- Rate Parametrization
 It is frequently a model for waiting times. For important
properties see here.
 It is more often parameterized in terms of a shape

parameter 𝑎 and an inverse scale parameter 𝑏 = 1/𝜃,
called a rate parameter:
a 
b
p ( x | a, b)  x a 1e  bx , x   0,   , (a )   u a 1e  u du
(a ) 0
 The mean, mode and variance with this parametrization are:

 a 1
  , for a  1 
 x  mod e  x    b var  x   2
b 0 otherwise b
Gamma Distribution
b a a 1
 Plots of Gamma ( X | a, b)  x exp( xb), b  1
(a )
 As we increase the rate 𝑏, the distribution squeezes

leftwards and upwards. For 𝑎 < 1, the mode is at zero.
Gamma distributions
0.9
a=1.0,b=1.0
0.8
a=1.5,b=1.0
0.7 a=2.0,b=1.0
0.6
0.5
0.4
0.3
0.2
0.1
1 2 3 4 5 6 7
Run gammaPlotDemo
from PMTK
Gamma Distribution
 An empirical PDF of rainfall data fitted with a Gamma
distribution.
3.5 3.5
MoM
MLE
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Run MatLab function gammaRainfallDemo

from PMTK
Exponential Distribution
 This is defined as
Expon ( X | l )  Gamma ( X |1, l )  l exp( xl ), x  0,  

 Here 𝜆 is the rate parameter.
 This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate
𝜆.
Chi-Squared Distribution
 This is defined as

 1 2
 1    1
 ( X |  )  Gamma ( X | , )    x 2 exp( ), x  0,  
2 2 x
2 2   2
 
2
 This is the distribution of the sum of squared Gaussian

random variables.
 More precisely,

Let Z i ~ (0,1) and S   Z i2 , then : S ~ 2
i 1
Inverse Gamma Distribution
 This is defined as follows:
If X ~ Gamma ( X | a, b)  X 1 ~ InvGamma ( X | a, b)
where:
b a  ( a 1)
InvGamma ( X | a, b)  x exp(b / x), x   0,  
(a )
 𝑎 is the shape and 𝑏 the scale parameters.
 Note that 𝑏 is a scale parameter since:

X
InvGamma ( | a,1)
InvGamma ( X | a, b)  b
b
 It can be shown that:
b b b2
Mean  (exists for a  1), Mode  , var  (exists for a  2)
a 1 a 1 (a  1) (a  2)
2
The Pareto Distribution
 Used to model the distribution of quantities that exhibit
long tails (heavy tails)
𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k, 𝑚 = 𝑘𝑚𝑘 𝑥 −(𝑘+1) 𝕀 𝑥 ≥ 𝑚
 This density asserts that 𝑥 must be greater than some
constant 𝑚, but not too much greater, 𝑘 controls what is
“too much”.
 Modeling the frequency of words vs. their rank (e.g. “the”,
“of”, etc.) or the wealth of people.*
 As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).
 On a log-log scale, the pdf forms a straight line of the form
log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power
law, Zipf’s law).
* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.
The Pareto Distribution
 Applications: Modeling the frequency of words vs their
rank, distribution of wealth (𝑘 =Pareto Index), etc.
Pareto ( X | k , m)  km k x  ( k 1) ( x  m), Pareto distribution
km
Mean  (if k  1), 2 m=0.01, k=0.10
k 1 1.8
m=0.00, k=0.50
m=1.00, k=1.00
Mode  m, 1.6
1.4
m2 k
var  (if k  2) 1.2
(k  1) (k  2)
2
1
0.8
0.6
0.4
0.2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ParetoPlot from PMTK
Covariance
 Consider two random variables X , Y :   .
 The joint probability distribution is defined as:
P  X  A, Y  B  P  X 1 ( A) Y 1 ( B )   p( x, y )dxdy
A B
 Two random variables are independent if
p ( x, y )  p ( x ) p ( y )
 The covariance of X and Y is defined as:
cov( X , Y )  ( X  ( X ))(Y  (Y )) 
 It is straight forward to verify that: cov( X , Y )   XY    X  Y 
Correlation, Center Normalized Random Variables
 Consider two random variables X , Y :   .
 The correlation coefficient of X and Y is defined as:

cov( X , Y )
corrc( X , Y ) 
 XY
where the standard deviations of X and Y are

 X  cov( X ),  Y  cov(Y )
~ 𝑋−𝔼 𝑋
𝑋=
𝜎𝑋
 The center normalized random variables are defined as:
~ 𝑌−𝔼 𝑌
𝑌=
𝜎𝑌
 It is straight forward to verify that:
~ ~ ~ ~
𝔼 𝑋 =𝔼 𝑌 =0 var 𝑋 = var 𝑌 = 1

Lec3 IntroToProbabilityAndStatistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec3 IntroToProbabilityAndStatistics

Uploaded by

Copyright:

Available Formats

Introduction to

Probability and Statistics

August 27, 2018

 The multinomial and multinoulli distributions

 The Poisson distribution

 Introduction to Covariance and Correlation

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

 This defines the Bernoulli distribution as follows:

 Using the indicator function, we can also write this as:

 Here H[𝑥] is the “entropy of the distribution”

 The likelihood function takes the form:

 We define the Binomial distribution as follows: binomial distribution

m 0.25 Bin ( N  10,   0.25)

 In our coin flipping experiment, 0.15

it gives the probability in 𝑁 flips to 0.1

probability getting heads in one flip. 0

 It can be shown (see S. Ross, Introduction to Probability

 Because 𝑚 = 𝑥1 + . . . + 𝑥𝑁, and for each observation the

 One can also compute [m], [m 2 ] by differentiating

(twice) the identity   m   (1   )  1 wrt 𝜇. Try it!

 Binomial theorem: (1  x)     x m (**)

This theorem is proved by induction using (*) and noticing:

 To finally show normalization using (**):

The variable is represented by a 𝐾-dimensional vector 𝒙 in

These vectors satisfy: x

 Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then

 The mean of the distribution is computed as:

similar to the result for the Bernoulli distribution.

 The Multinoulli also known as the Categorical distribution

 The parameter 1 stands to emphasize that we roll a 𝐾-sided

 mk is the “sufficient statistic” of the distribution.

 From the expression for the likelihood given earlier

the multinomial distribution Mu (m1 ,..., mK | N ,  ) with

sequences where there caat ccg t g t t g g g a N=1:10

of a gene coding region), since the corresponding columns

 This distribution is known

 Can also compute the

most probable letter

 𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)

 𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)

 This is a model for counts of rare events.

 The parameter 𝜆 is called the precision of the 𝒯-distribution, even

 The parameter 𝜐 is called the degrees of freedom.

 For the particular case of 𝜐 = 1, the T-distribution reduces to the

 In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian

 For large 𝜐, we can approximate the log as follows:

N  x | 0,1 , T ( x | 0,1,1), Lap x | 0,1/ 2

 The mean and variance of the 0.4

Student’s is undefined for 𝜐 = 1. 0.3

 Logs of the PDFs. The Student’s 0.1

is NOT log concave. -4 -3 -2 -1 0 1

log prob. density functions

 When 𝜐 = 1, the distribution is -3

known as Cauchy or Lorentz. -4

Due to its heavy tails, the mean -6

does not converge. -7

 The Student’s T distribution can be seen from the equation

 The effect of a small number of outliers (Fig. on the right) is less

Run MatLab function robustDemo