You are on page 1of 45

Introduction to

Probability and Statistics


(Continued)
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

August 27, 2018

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
 The binomial and Bernoulli distributions

 The multinomial and multinoulli distributions

 The Poisson distribution

 Student’s T

 Laplace distribution

 Gamma distribution

 Beta distribution

 Pareto distribution

 Introduction to Covariance and Correlation

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge


University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena


Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical


Inference. Springer.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Binary Variables
 Consider a coin flipping experiment with heads = 1 and
tails = 0. With   [0,1]
p ( x  1|  )  
p( x  0 |  )  1  

 This defines the Bernoulli distribution as follows:

Bern ( x |  )   x (1   )1 x

 Using the indicator function, we can also write this as:

Bern ( x |  )   ( x 1)
(1   ) ( x  0)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
Bernoulli Distribution
 Recall that in general

[ f ]   p( x) f ( x), [ f ]   p( x) f ( x)dx
x

var[ f ]  [ f ( x) 2 ]  [ f ( x)]2
 For the Bernoulli distribution Bern ( x |  )   x (1   )1 x , we
can easily show from the definitions:
[ x]  
var[ x]   (1   )

[ x]   
x0,1
p ( x |  ) ln p( x |  )    ln   (1   ) ln(1   )

 Here H[𝑥] is the “entropy of the distribution”


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 5
Likelihood Function for Bernoulli Distribution
 Consider the data set

D   x1 , x2 ,..., xN 
in which we have 𝑚 heads (𝑥 = 1), and 𝑁 − 𝑚 tails (𝑥 = 0)

 The likelihood function takes the form:

N N
p (D |  )   p ( xn |  )   xn (1   )1 xn  m (1   ) N  m
n 1 n 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Binomial Distribution
 Consider the discrete random variable X 0,1, 2,..., N 

 We define the Binomial distribution as follows: binomial distribution


0.35

N m
Bin ( X  m | N ,  )     (1   ) N  m
0.3

m 0.25 Bin ( N  10,   0.25)


0.2

 In our coin flipping experiment, 0.15

it gives the probability in 𝑁 flips to 0.1


Matlab Code
get 𝑚 heads with 𝜇 being the 0.05

probability getting heads in one flip. 0


0 1 2 3 4 5 6 7 8 9 10

 It can be shown (see S. Ross, Introduction to Probability


Models) that the limit of the binomial distribution as
N  , N   l , is the Poisson(l) distribution.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Binomial Distribution
 The Binomial distribution for 𝑁 = 10, and   0.25, 0.9 is
shown below using MatLab function binomDistPlot from
Kevin Murphys’ PMTK
 =0.250
0.25  =0.900
0.9
0.35 0.4

0.3 0.35

0.25 0.3

0.25
0.2

0.2
0.15

0.15
0.1
0.1

0.05
0.05

0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10

Bin ( N ,  )

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
Mean,Variance of the Binomial Distribution
 Consider for independent events the mean of the sum is
the sum of the means, and the variance of the sum is the
sum of the variances.

 Because 𝑚 = 𝑥1 + . . . + 𝑥𝑁, and for each observation the


mean and variance are known from the Bernoulli
distribution:
N
[m]   mBin (m | N ,  )  [ x1  ...  xN ]  N 
m0
N
var[m]    m  [m] Bin (m | N ,  )  var[ x1  ...  xN ]  N  (1   )
2

m0

 One can also compute [m], [m 2 ] by differentiating


N N

(twice) the identity   m   (1   )  1 wrt 𝜇. Try it!


m 1
m N m

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
Binomial Distribution: Normalization
To show that the Binomial is correctly normalized, we use the
following identities:
 N   N   N  1
 Can be shown with direct substitution:     (*)
  
n n  1   n 
N
N

 Binomial theorem: (1  x)     x m (**)


N

m0  m 

This theorem is proved by induction using (*) and noticing:


N
N m N
 N  m N  N  m 1 N  N  m N 1  N  m
(1  x) N 1
    x (1  x)     x     x     x    x 
m0  m  m0  m  m0  m  m0  m  m 1  m  1 
 N m  N  N  m N 1   N  1 m N 1 N  1
*
N N
  m
      
N 1
1  x   x  x   1    x  x   x
 m 1  m    m 1  m  1  m 1  m  m 0  m 

 To finally show normalization using (**):


m N
N
N m N
 N    N   
    (1   ) N m
 (1   ) N
   
m 0  m   1   
 (1   )  1   1
m0  m   1  
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
Generalization of the Bernoulli Distribution
We are now looking at discrete variables that can take on
one of 𝐾 possible mutually exclusive states.

The variable is represented by a 𝐾-dimensional vector 𝒙 in


which one of the elements 𝑥𝑘 equals 1, and all remaining
elements equal 0: 𝒙 = 0,0, … , 1,0, … , 0 𝑇
K

These vectors satisfy: x


k 1
k 1

 Let the probability of 𝑥𝑘 = 1 be denoted as 𝜇𝑘. Then


K K K
p( x |  )      k
k 1
xk
k
k 1
( xk 1)
, 
k 1
k  1, k  0

where 𝝁 = 𝜇1 , … , 𝜇𝐾 𝑇 .

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Multinoulli/Categorical Distribution
 The distribution is already normalized:
K K

 p( x |  )    
x x k 1
xk
k   k  1
k 1

 The mean of the distribution is computed as:


 x |     xp( x |  )   1 ,...,  K 
T

x

similar to the result for the Bernoulli distribution.

 The Multinoulli also known as the Categorical distribution


often denoted as (Mu here is the multinomial distribution):
Cat  x |    Multinoulli  x |    Mu  x |1,  

 The parameter 1 stands to emphasize that we roll a 𝐾-sided


dice once (𝑁 = 1) – see next for the multinomial distribution.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Likelihood: Multinoulli Distribution
 Let us consider a data set D   x1 ,..., xN  . The likelihood
becomes:
N
N K K  xnk K N
p(D |  )    xnk
k   k n1
  , where : mk  xnk
mk
k
n 1 k 1 k 1 k 1 n 1

is the # of observations of 𝑥𝑘 = 1.

 mk is the “sufficient statistic” of the distribution.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
MLE Estimate: Multinoulli Distribution
 To compute the maximum likelihood (MLE) estimate of 𝝁, we
maximize an augmented log-likelihood
 K  K  K 
ln p (D |  )  l   k  1   mk ln k  l   k  1
 k 1  k 1  k 1 
mk
 Setting the derivative wrt 𝜇𝑘 equal to zero: K  
l
Substitution into the constraint
K

K m k K K 
mk

mk

k 1
k 1  k 1

l
 1  l   mk 
k 1 m
K

k
N
k 1

As expected, this is
the fraction in the 𝑁
observations of 𝑥𝑘 = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Multinomial Distribution
 We can also consider the joint distribution of 𝑚1, … , 𝑚𝐾 in 𝑁
observations conditioned on the parameters 𝝁 =
(𝜇1, … , 𝜇𝐾).

 From the expression for the likelihood given earlier


K
p (D |  )    kmk
k 1

the multinomial distribution Mu (m1 ,..., mK | N ,  ) with


parameters 𝑁 and 𝝁 takes the form:

N!
p (m1 , m2 ,..., mK | N , 1 , 2 ,...,  K )  1m1 2m2 ... Kmk where  k 1 mk  N
K

m1 !m2 !...mk !

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
Example: Biosequence Analysis
cgat acg gggtcgaa
 Consider a set of DNA caat ccg agatcgca
Sequences

sequences where there caat ccg t g t t g g g a N=1:10


are 10 rows (sequences) caat cgg catgcggg
and 15 columns (locations cgagccg cgtacgaa
cat acgg agcacgaa
along the genome).
t aat ccg ggcatgta
Several locations are cgagccg agtacaga
conserved by evolution ccat ccg cgtaagca
ggat acg agatgaca
(e.g., because they are part Location along the genome

of a gene coding region), since the corresponding columns


tend to be pure e.g., column 7 is all 𝑔’s.
 To visuallize the data (sequence logo), we plot the letters
𝐴, 𝐶, 𝐺 and 𝑇 with a font size proportional to their empirical
probability, and with the most probable letter on the top.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Example: Biosequence Analysis
 The empirical probability distribution at location 𝑡, is
obtained by normalizing the vector of counts (see MLE
estimate)
𝑁 𝑁 𝑁 𝑁
1
෡𝑡 =
𝜽 ෍𝕀(𝑋𝑖𝑡 = 1) , ෍𝕀(𝑋𝑖𝑡 = 2) , ෍𝕀(𝑋𝑖𝑡 = 3) , ෍𝕀(𝑋𝑖𝑡 = 4)
𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
2

 This distribution is known


as a motif.

 Can also compute the


Bits
1

most probable letter


in each location; this is the
consensus sequence.
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position
Use MatLab function seqlogoDemo from Kevin Murphys’ PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Summary of Discrete Distributions
 A summary of the multinomial and related discrete
distributions is summarized below on a Table from Kevin
Murphy’s textbook

 𝑛 = 1 (one roll of the dice), 𝑛 = − (𝑁 rolls of the dice)

 𝐾 = 1 (binary variables), 𝐾 = − (1-of-𝐾 encoding)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
The Poisson Distribution
 We say that X  0,1, 2,3,... has a Poisson distribution with
parameter 𝜆 > 0, if its pmf is
l lx
X ~ Poi (l ) : Poi ( x | l )  e
x!

 This is a model for counts of rare events.


Poi(l=1.000) Poi(l=10.000)
0.4 0.14

0.35 0.12

0.3
0.1

0.25
0.08

0.2
0.06

0.15
0.04
0.1

0.02
0.05

0
0 0 5 10 15 20 25 30
0 5 10 15 20 25 30
Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
The Empirical Distribution
 Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical
distribution as:
1 N
1 if xi  A
pemp ( A)    xi ( A), Dirac Measure :  xi ( A)  
N i 1 0 if xi  A
 We can also associate weights with each sample:
N N N
1
Generalize pemp ( x) 
N
 i 1
xi ( x)  pemp ( x)   wi xi ( x) , 0  wi  1,  wi  1
i 1 i 1
 This corresponds to a histogram with spikes at each
sample point with height equal to the corresponding
weight. This distribution assigns zero weight to any point
not in the dataset.
 Note that the “sample mean of 𝑓(𝑥)” is the expectation of
𝑓(𝑥) under the empirical distribution:
N N
1 1
[ f ( x)]   f ( x)  xi ( x)dx   f (x ) i
i 1 N N n 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 20
Student’s T Distribution

1 2  /2 1/2
(  )
 l   lx   
1/2

p ( x |  , l , )  2 2 1  
     
( )  
2

 The parameter 𝜆 is called the precision of the 𝒯-distribution, even


though it is not in general equal to the inverse of the variance (see
below on behavior as 𝜐 → ∞).

 The parameter 𝜐 is called the degrees of freedom.

 For the particular case of 𝜐 = 1, the T-distribution reduces to the


Cauchy distribution.

 In the limit 𝜐 → ∞, the 𝒯-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) becomes a Gaussian


𝒩(𝑥|𝜇, 𝜆−1 ) with mean 𝜇 and precision 𝜆.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian

1 2  /2 1/2
(  )
 l   lx   
1/2

p ( x |  , l , )  2 2 1  
     
( )  
2
 We first write the distribution as follows:
 /2 1/2
 l  x   2     1  l  x   2  
T ( x |  , l , )  1    exp  ln 1  
    2    

 For large 𝜐, we can approximate the log as follows:


   1  l  x   2    
 l   
2

x 1 
T ( x |  , l , )  exp    O( )    exp 
2
 O( ) 
 2      2 
 In the limit 𝜐 → ∞, the T-distribution 𝒯(𝑥|𝜇, 𝜆, 𝜐) is indeed a Gaussian
𝒩(𝑥|𝜇, 𝜆−1 ) with mean 𝜇 and precision 𝜆. The normalization of the T is
valid in this limit as well (so the Gaussian obtained is normalized).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Student’s T Distribution
 1 2  /2 1/2
(  )  lx   
 l 
1/2

p ( x |  , l , )  2 2 1  
     
( )  
2
student distribution
0.4
v=10
0.35 v=1.0
v=0.1

Mean :  ,   1 0.3

  0, l  1
Mode :  0.25
For   , we
 0.2
obtain N (  , l 1 )
Var : ,  2
l   2  0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Student’s T Vs the Gaussian
prob. density functions

 We plot: 0.8
Gauss
Student

 
0.7

N  x | 0,1 , T ( x | 0,1,1), Lap x | 0,1/ 2


Laplace

0.6

0.5

 The mean and variance of the 0.4

Student’s is undefined for 𝜐 = 1. 0.3

0.2

 Logs of the PDFs. The Student’s 0.1

is NOT log concave. -4 -3 -2 -1 0 1

log prob. density functions


2 3 4

0
Run MatLab function studentLaplacePdfPlot Gauss
-1 Student
from Kevin Murphys’ PMTK Laplace
-2

 When 𝜐 = 1, the distribution is -3

known as Cauchy or Lorentz. -4

-5

Due to its heavy tails, the mean -6

does not converge. -7

-8

 Recommended to use 𝜐 = 4. -9
-4 -3 -2 -1 0 1 2 3 4

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Student’s T Distribution

p ( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d 
0

    
1/2
2 b
a
   exp    x      a 1  b
e d
0
2   2  (a )

 The Student’s T distribution can be seen from the equation


above (see following two slides for proof) as an infinite
mixture of Gaussians each of them with different
precision (governed by a Gamma distribution)
The result is a distribution that in general has longer ‘tails’
than a Gaussian.
This gives the T-distribution robustness, i.e. the T-
distribution is much less sensitive than the Gaussian to the
presence of outliers.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Appendix: Student’s T as a Mixture of Gaussians
 If we have a univariate Gaussian 𝒯(𝑥|𝜇, 𝜏 −1 ) together with a prior
Gamma(𝜏 |𝑎, 𝑏) and we integrate out the precision, we obtain the
marginal distribution of 𝑥

p ( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d 
0

    
1/2
2 b
a
   exp    x      a 1  b
e d 
0
2   2  (a )
 1 2
 Introduce the transformation z  b
 2  x      to simplify as:
A

1/2 
ba  1 
p ( x |  , a, b)      1/2
exp   z   a 1
d 
(a)  2  0
1/2 
b  1 
a
1
   1/2  a 11 
z 1/2
exp   z  z a 1
dz
(a)  2  A 0

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Appendix: Student’s T as a Mixture of Gaussians
1/2 
ba  1  1
p ( x |  , a, b)    1/2  a 11 
z 1/2
exp   z  z a 1
dz
(a)  2  A 0
1/2  a 1/2 
ba  1   1 2
 b   x        a 1/2
  exp  z z dz
(a)  2   2  0

 Recalling the definition of the Gamma function: (a)  exp   z  z a 1dz
 0

1/2  a 1/2
ba  1   1 2 1
p ( x |  , a, b)    b   x    (a  )
(a)  2   2  2
a
 It is common to redefine the parameters in this distribution as:   2a, l 
 1 b
(  ) 2  /2 1/2
 l   lx   
1/2

p ( x |  , l , )  2 2 1  
     
( )  
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Robustness of Student’s T Distribution
 The robustness of the T-distribution is illustrated here by comparing
the “maximum likelihood solutions” for a Gaussian and a T-distribution
(30 data points from the Gaussian are used).

 The effect of a small number of outliers (Fig. on the right) is less


significant for the T-distribution than for the Gaussian.
0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
T-distribution & 0.25
0.2
0.2 Gaussian
0.15
Gaussian nearly
0.15
0.1 overlap 0.1
0.05
0.05
0
-10 -8 -6 -4 -2 0 2 4 6 8 10 12 0
-10 -8 -6 -4 -2 0 2 4 6 8 10 12

MatLab Code
T-distribution
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Robustness of Student’s T Distribution
 The earlier simulation is repeated here with the PMTK toolbox.
0.5
gaussian
student T
laplace
0.4
0.5
gaussian
0.3 student T
laplace
0.4

0.2

0.3

0.1

0.2
0
-5 0 5 10

0.1

0
-5 0 5 10

Run MatLab function robustDemo


from Kevin Murphys’ PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
The Laplace Distribution
 Another distribution with heavy tails is the Laplace distribution, also
known as the double sided exponential distribution. It has the following
pdf:
x
1 
Lap  x |  , b   e b
2b
 𝜇 is a location parameter and 𝑏 > 0 is a scale parameter

Mean   , Mode   , Var  2b 2

 Its robust to outliers (see earlier demonstration).

 It puts mores probability density at 0 than the Gaussian. This property


is a useful way to encourage sparsity in a model.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 30
Beta Distribution
 The Beta(𝛼, 𝛽) distribution with x  [0,1],  ,   0 is defined as
follows:
(   )  1  1 x 1 (1  x)  1       
( x)  x (1  x)  , beta ( ,  ) 
( )(  ) beta( ,  )     
Normalizing
factor

 The expected value, mode and variance of a Beta


random variable 𝑥 with (hyper-) parameters 𝛼 and 𝛽 :
 a
beta distributions

-1
 
3

 mode éë x ùû =
a=0.1, b=0.1
x , a=1.0, b=1.0

  a +b -2
2.5 a=2.0, b=3.0
a=8.0, b=4.0


var  x   1.5

    (    1)
2
1

0.5

 For more information visit this link. 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 31
Beta Distribution
 If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.

 If 𝛼 and 𝛽 are both less than 1, we get a bimodal


distribution with spikes at 0 and 1.
beta distributions
3
 If 𝛼 and 𝛽 are both greater a=0.1, b=0.1
a=1.0, b=1.0
than 1, the distribution 2.5 a=2.0, b=3.0 𝛼,𝛽>1
a=8.0, b=4.0
is unimodal. 2

𝛼,𝛽<1
Run betaPlotDemo 1.5
from PMTK

1
𝛼=𝛽=1

0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Beta Distribution
3 3
Beta(0.1,0.1) Beta(1,1)

2.5 2.5

2 2 See Matlab implementation


pdf

1.5

pdf
1.5

1
1

0.5
0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
x
3
Beta(2,3) 3
Beta(8,4)
2.5
2.5

2
2
pdf

1.5

pdf
1.5

1
1

0.5
0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Gamma Function
(   )  1
( x)  x (1  x)  1
( )(  )

 The gamma function extends the factorial to real numbers:



( x)   u x 1e u du
0

 With integration by parts:


( x  1)  x( x)
 For integer 𝑛:
(n)  (n  1)!
 For more information visit this link.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Beta Distribution: Normalization
 Showing that the Beta(𝛼, 𝛽) distribution is normalized
correctly is a bit tricky. We need to prove that:
1

1   
 1
( )(  )  (   )    1
d
0
Follow the steps:
 (a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;
 (b) change the order of integration in the shaded
triangular region;
 and (c) change 𝑥 to m via 𝑥 = 𝑡𝜇: t
𝑡=𝑥
  1  x   1  y 
 
 1 

 
( )(  )    x e dx   y e dy    x   e  t  x  dt  dx 
t  1
x
0  0  y t  x 0 x 

 t  1  t   1
    x e  t  x  dx  dt   t e t tdt    1 1    d  
 1  1  t  1  1

00  0 0
1
 (   )    1 1   
 1
d
0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Gamma Distribution- Rate Parametrization
 It is frequently a model for waiting times. For important
properties see here.

 It is more often parameterized in terms of a shape


parameter 𝑎 and an inverse scale parameter 𝑏 = 1/𝜃,
called a rate parameter:
a 
b
p ( x | a, b)  x a 1e  bx , x   0,   , (a )   u a 1e  u du
(a ) 0

 The mean, mode and variance with this parametrization are:


 a 1
  , for a  1 
 x  mod e  x    b var  x   2
b 0 otherwise b

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Gamma Distribution
b a a 1
 Plots of Gamma ( X | a, b)  x exp( xb), b  1
(a )

 As we increase the rate 𝑏, the distribution squeezes


leftwards and upwards. For 𝑎 < 1, the mode is at zero.
Gamma distributions
0.9
a=1.0,b=1.0
0.8
a=1.5,b=1.0
0.7 a=2.0,b=1.0

0.6

0.5

0.4

0.3

0.2

0.1

1 2 3 4 5 6 7

Run gammaPlotDemo
from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Gamma Distribution
 An empirical PDF of rainfall data fitted with a Gamma
distribution.
3.5 3.5
MoM
MLE
3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5

Run MatLab function gammaRainfallDemo


from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Exponential Distribution
 This is defined as
Expon ( X | l )  Gamma ( X |1, l )  l exp( xl ), x  0,  


 Here 𝜆 is the rate parameter.

 This distribution describes the times between events in a


Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate
𝜆.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
Chi-Squared Distribution
 This is defined as

 1 2
 1    1
 ( X |  )  Gamma ( X | , )    x 2 exp( ), x  0,  
2 2 x
2 2   2
 
2

 This is the distribution of the sum of squared Gaussian


random variables.

 More precisely,

Let Z i ~ (0,1) and S   Z i2 , then : S ~ 2
i 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
Inverse Gamma Distribution
 This is defined as follows:

If X ~ Gamma ( X | a, b)  X 1 ~ InvGamma ( X | a, b)
where:
b a  ( a 1)
InvGamma ( X | a, b)  x exp(b / x), x   0,  
(a )
 𝑎 is the shape and 𝑏 the scale parameters.

 Note that 𝑏 is a scale parameter since:


X
InvGamma ( | a,1)
InvGamma ( X | a, b)  b
b
 It can be shown that:
b b b2
Mean  (exists for a  1), Mode  , var  (exists for a  2)
a 1 a 1 (a  1) (a  2)
2

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
The Pareto Distribution
 Used to model the distribution of quantities that exhibit
long tails (heavy tails)
𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k, 𝑚 = 𝑘𝑚𝑘 𝑥 −(𝑘+1) 𝕀 𝑥 ≥ 𝑚
 This density asserts that 𝑥 must be greater than some
constant 𝑚, but not too much greater, 𝑘 controls what is
“too much”.
 Modeling the frequency of words vs. their rank (e.g. “the”,
“of”, etc.) or the wealth of people.*
 As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).
 On a log-log scale, the pdf forms a straight line of the form
log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power
law, Zipf’s law).
* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
The Pareto Distribution
 Applications: Modeling the frequency of words vs their
rank, distribution of wealth (𝑘 =Pareto Index), etc.
Pareto ( X | k , m)  km k x  ( k 1) ( x  m), Pareto distribution
km
Mean  (if k  1), 2 m=0.01, k=0.10

k 1 1.8
m=0.00, k=0.50
m=1.00, k=1.00
Mode  m, 1.6

1.4
m2 k
var  (if k  2) 1.2
(k  1) (k  2)
2
1

0.8

0.6

0.4

0.2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

ParetoPlot from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Covariance
 Consider two random variables X , Y :   .

 The joint probability distribution is defined as:

P  X  A, Y  B  P  X 1 ( A) Y 1 ( B )   p( x, y )dxdy
A B

 Two random variables are independent if

p ( x, y )  p ( x ) p ( y )

 The covariance of X and Y is defined as:

cov( X , Y )  ( X  ( X ))(Y  (Y )) 

 It is straight forward to verify that: cov( X , Y )   XY    X  Y 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44
Correlation, Center Normalized Random Variables
 Consider two random variables X , Y :   .

 The correlation coefficient of X and Y is defined as:


cov( X , Y )
corrc( X , Y ) 
 XY

where the standard deviations of X and Y are


 X  cov( X ),  Y  cov(Y )
~ 𝑋−𝔼 𝑋
𝑋=
𝜎𝑋
 The center normalized random variables are defined as:
~ 𝑌−𝔼 𝑌
𝑌=
𝜎𝑌
 It is straight forward to verify that:

~ ~ ~ ~
𝔼 𝑋 =𝔼 𝑌 =0 var 𝑋 = var 𝑌 = 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 45

You might also like