You are on page 1of 10

Probability Review: Lecture Notes: Part 1

Basic Distributions and Moments


Yuly Koshevnik

Introduction
The main purpose of the notes is to quickly summarize facts from Probability Theory (STAT 4351
or 5351) and Mathematical Statistics (STAT 4352 or 5352) courses. The notes also provide further
generalizations of the basic notions that will be extensively utilized throughout this course.
Detailed descriptions can be found in the first chapter of your textbook.

1 Commonly Used Discrete Distributions


Recall that a random variable (X) is discretely distributed (or is discrete) if its list of all possible values
can be presented by a finite or countable set of distinct values. Typical situations include (but are not
limited by) the following examples.

1.1 Discrete Uniform Distribution


This distribution is supported by a finite set of integer values, X = {1, 2, . . . , n} , and has the probabilities
assigned as follows:
1
P [X = k] = for k = 1, 2, . . . , n. (1)
n
For the sake of quick reference, this distribution is going to be denoted as DU [n] . The expected value of
X is n
1 X 1 n (n + 1) n+1
µX = E [X] = k= × = . (2)
n k=1 n 2 2
The second moment of X is
n
h i 1 X 1 n (n + 1) (2n + 1) (n + 1) (2n + 1)
E X2 = k2 = × = .
n k=1 n 6 6

It is a relatively easy to evaluate the variance of X, which is defined as


h i h i
2
σX = Var [X] = E (X − µ)2 = E X 2 − µ2 ,

1
where µ = µX = E [X] . Indeed,
2
(n + 1) (2n + 1) n+1 n+1 (n + 1) (n − 1)

2
σX = Var [X] = − = × [2 (2n + 1) − 3 (n + 1)] = ,
6 2 12 12
which can be simplified to
n2 − 1

2
σX = Var [X] = . (3)
12

1.2 Bernoulli Distribution


This is a special distribution with the support, X = {0, 1} , where the value of X = 1 is usually considered
as a success and the value of X = 0 is interpreted as a failure. Usually, p = P [X = 1] is referred to as
a success rate and 1 − p = P [X = 0] is then named a failure rate. The expected value of X and any
moment of X are easy to derive:

µ = µX = E [X] = E [X r ] = p, for any natural r. (4)

Hence the variance of X is


h i
σ 2 = σX
2
= Var [X] = E X 2 − (E [X])2 = p − p2 = p (1 − p) . (5)

1.3 Binomial Distribution


A common extension of the Bernoulli distribution as known as Binomial, which is going to be denoted
as Bin (n, p) . If {Xj : 1 ≤ j ≤ n} represent n independent and identically distributed random variables,
each following the same Bernoulli distribution, then their sum,
n
X
Y = Xj ,
j=1

follows the distribution denoted as Bin (n, p) . It is known from Probability Theory course that

n!
P [Y = k] = pk (1 − p)n−k for k = 0, 1, . . . , n. (6)
k! (n − k)!

If n = 1, this distribution is the Bernoulli one, with the success rate equal to p. That is why Bernoulli can
be denoted as Bin (1, p) . It was shown in Probability Theory that

µY = E [Y ] = np and σY2 = Var [Y ] = np (1 − p) . (7)

It is important to realize that a single Bernoulli variable is usually identified with the result of one game,
while Y (or the sum of n independent Bernoulli variables) measures the number of victories after a set of
n games.

2
1.4 Geometric and Negative Binomial Distributions
A random variable (X) has a Geometric distribution with parameter (also referred to as a success rate)
p if it takes an integer value from X = {0, 1, . . . , } , with probabilities:

P [X = k] = p (1 − p)k for k = 0, 1, . . . . (8)

Under this definition, X is interpreted as a number of losses (or failures) prior to the first success. If you
consider a shifted variable, W = X + 1, then W indicates the time when first success occurs. Equivalently,
W = m is and only if the first (m − 1) trials result in failures, and the mth trial is a success.
Expected value and variance of W can be found as follows:
1 (1 − p)
E [W ] = and Var [W ] = . (9)
p p2
Using the relationship, X = W − 1, we obtain:
1 (1 − p) (1 − p)
E [X] = −1= and Var [X] = . (10)
p p p2
Notice that Var [X] = Var [W ] .
The negative binomial distribution relates to the geometric one in a way similar to what the binomial
was for the Bernoulli variable. If {Wj : j = 1, 2, . . . , r} are independent random variables, each having
the same geometric distribution, Geom [p] , then their sum, Y = rj=1 Wj , is said to have the negative
P

binomial distribution denoted as NB [r, p] . The event, [Y = k] , states that the rth success occurs at the
trial number k. The formula for negative binomial probabilities is as follows:
(k − 1)!
P [Y = k] = pr (1 − p)k−r , for k = r, r + 1, . . . . (11)
(r − 1)! (k − r)!

The variable T = Y − r, measuring the total number of failures before the rth success occurs, has a shifted
distribution,
(m + r − 1)! r
P [T = m] = p (1 − p)m , for m = 0, 1, . . . . (12)
(r − 1)! (m)!
The expected value and variance for either version of the negative binomial distribution can be found
similar to what you have seen in the Probability Theory course. Using the same convention, that is Y is
the time of the rth success and T is the count of failures that preceded this success, you can obtain:
r r (1 − p)
E [Y ] = and E [T ] = , (13)
p p
while
(1 − p)
Var [Y ] = Var [T ] = r · . (14)
p2

Remark: Both geometric and negative binomial distributions have a countable supporting set, unlike
the previously considered examples.

3
1.5 Poisson Distributions
This is another type of a discrete distribution, also with the countable set of distinct values. A random
variable (Y ) follows a Poisson distribution with the intensity parameter = λ if

λk −λ
P [Y = k] = e for k = 0, 1, 2, . . . . (15)
k!
The interpretation of Poisson distributions is simple: when Yn has the binomial distribution, Bin [n, p] ,
with n → ∞ and p = pn → 0, in such a way that np → λ > 0, or simply n pn ≈ λ > 0, then the probabilities
converge:
λk −λ
lim P [Yn = k] = e = P [Y = k] for k = 0, 1, 2, . . . .
n→∞ k!
The expectation and variance of the Poisson distribution were derived in Probability Theory course as

E [Y ] = λ and Var [Y ] = λ. (16)

2 Common Continuous Distributions


For a discrete random variable, the key notion that describes its distribution is the probability mass
function, or p.m.f,
f (x) = P [X = x] , (17)
where x belongs to a set X of distinct integer values. Usually, it is either a set of natural or (sometimes)
whole numbers (starting at x = 0.) For continuous random variables, especially real-valued ones, we shall
use the same notation to refer to the probability density function, or p.d.f., f (x) , assuming that
instead of probabilities in equation (17) the cumulative distribution function (CDF), say F (t) , is defined
as Zt
F (t) = f (x) dx for all t ∈ (−∞, ∞) . (18)
−∞
Generally, the notion of CDF makes sense for discrete cases:
X X
F (t) = P [X = x] = f (x) .
x≤t x≤t

Commonly used continuous distributions will be summarized in the same way as we did with their discrete
analogs. The scenario is aimed at making you refresh your memories from the Probability course is such
that the density function is introduced first. After that, the first and second moments will be outlined.

4
2.1 Uniform Distributions
A real valued random variable (U ) follows the uniform distribution over the unit interval, (0, 1) if its density
is
f (x) = 1 for 0 < x < 1 and f (x) = 0 elsewhere (or otherwise) (19)
Often we need to extend the uniform distribution by considering it extended to an interval, (a, b) , where
a < b. The general uniform distribution has the density function,
1
f (x) = f (x |a, b ) = for a < x < b and f (x) = 0 elsewhere (20)
b−a
It is easy to see equivalence of two statements:

• X follows Unif (a, b), where a < b and

• Variable,
X −a
U=
b−a
has the standard uniform on the interval (0, 1).

Solving out the previous equation for X, we obtain:

X = a + (b − a) U. (21)

Recall that the moments of a continuous random variable, say X, are defined as integrals. If T = T (X) is
a transformed random variable, such a power of X or a polynomial of X, then
Z ∞
E [T ] = T (x) f (x) dx, (22)
−∞

provided this (generally, improper) integral converges to a finite value. Fortunately, the uniform distribu-
tion does not create such a trouble as divergent integral. The expected value and variance are:

a+b (b − a)2
E [X] = and Var [X] = . (23)
2 12
In particular, if U is uniformly distributed over the unit interval, (0, 1) , we obtain:
1 1
E [U ] = and Var [U ] = .
2 12
It is also helpful to realize that
1
E [U r ] = for any r > −1,
r+1
not only for moments of integer order. If r ≤ −1, the integral above diverges.

5
2.2 Gamma Distributions
Recall the definition of Γ function introduced by Leonard Euler for argument a > 0.
Z ∞
Γ (a) = xa−1 e−x dx. (24)
0

It can be shown for any a in the domain {a > 0} that

Γ (a + 1) = a · Γ (a) ,

and Γ (1) = 1. Using induction, for any natural n, we obtain:

Γ (n) = (n − 1)!,

so Γ (2) = 1, and Γ (3) = 2! = 2, and so on.


The standardized Gamma density is defined as
1
f (x) = f (x |a, 1 ) = xa−1 e−x for 0 < x < ∞ and f (x) = 0 otherwise. (25)
Γ (a)

Like any continuous density, it satisfies the condition:


Z ∞
f (x) dx = 1.
−∞

Notice that integration is actually performed over the interval, (0, ∞) . The first two moments of Gamma
distributed and standardized random variable, say X, are:
h i
E [X] = Γ (a + 1) ÷ Γ (a) = a and E X 2 = Γ (a + 2) ÷ Γ (a) = a (a + 1) . (26)

Therefore, the variance of standardized Gamma distributed random variable is:


h i
Var [X] = E X 2 − (E [X])2 = a (a + 1) − a2 = a. (27)

A general Gamma distribution has two parameters, a being viewed as the shape and b that indicates the
scale. It density function is:
1 x
f (x) = f (x |a, b ) = xa−1 e− b for 0 < x < ∞. (28)
ba Γ (a)

It is easy to notice that if Y has the density (28), then X = Yb follows the distribution (25). Therefore,
the first two moments of a general Gamma distributed random variable, Y, can be found as

E [Y ] = ab and Var [Y ] = ab2 . (29)

6
2.2.1 Special Case: Exponential Distributions
The exponential distribution is a particular case of Gamma, with a = 1 and b > 0. Frequently, the density
function defined by (28) is presented in the form:
1 x
 
f (x) = f (x | b) = exp − , (30)
b b
with the scale parameter b replaced by its reciprocal, λ = 1b ., so equation (30) is replaced by

f (x) = λ · e−λx , where x > 0. (31)

Then the expected value and variance are:


1 1
E [X] = and Var [X] = .
λ λ2

2.3 Beta Distributions


Leonard Euler also introduced another function, so called Beta one,
Z 1
B (a, b) = xa−1 (1 − x)b−1 dx, (32)
0

defined for {a > 0, b > 0) . He also established the identity:


Γ (a) Γ (b)
B (a, b) = ,
Γ (a + b)
which explains the next definition. A random variable, X, has the Beta distribution with parameters (a, b)
if its density function is defined for 0 < x < 1 as
Γ (a + b) a−1
f (x) = f (x |a, b ) = x (1 − x)b−1 , (33)
Γ (a) Γ (b)
and f (x) = 0, otherwise. Two parameters are assumed to be strictly positive real numbers: a > 0 and
b > 0.
Due to already mentioned properties of Gamma function, the first two moments of Beta distribution can
be easily derived.
Γ (a + b) · Γ (a + 1) a
E [X] = = . (34)
Γ (a) Γ (a + b + 1) a+b
Similarly, the second moment is
h i a (a + 1)
E X2 = ,
(a + b) (a + b + 1)
which results in
ab
Var [X] = 2 . (35)
(a + b) · (a + b + 1)

7
3 Transformed Continuous Variables
A common setup includes a real-valued continuous variable, X, with density function, f (x) = f X (x), and
transformed variable, Y = T (X). Several facts related to CDF and density for Y covered in STAT 4351
are reminded here.

3.1 Shift Transformation


Assume that Y = X + a, where a is a real number.
1. CDF for Y is:
F Y (y) = P [Y ≤ y] = F X (y − a)

2. Density function is:


f Y (y) = f X (x − a)

3. Expected value and variance are:

E [Y ] = E [X] + a and Var [Y ] = Var [X]

3.2 Reflection
Assume that Y = −X.
1. CDF for Y is:
F Y (y) = P [−X ≤ y] = 1 − F (−y)

2. Density function is:


f Y (y) = f X (−y)

3. Expectation and variance are:

E [Y ] = −E [X] and Var [Y ] = Var [X]

3.3 Scale Transformation


Assume that a scale parameter is b > 0 and Y = b · X.
1. CDF for Y is:
y
 
F Y (y) = P [X ≤ y · b−1 ] = F X
b
2. Density for Y is:
1 y
 
f (y) = · f X
Y
b b
3. Expectation and variance for Y are:

E [Y ] = b · E [X] and Var [Y ] = b2 · Var [X]

8
3.4 Scale and Shift Transformation
Assume that b > 0 and a is an arbitrary real number. Consider Y = a + b · X. Combining two previously
considered situations, obtain results for Y as follows.
1. CDF for Y is:
y−a
 
Y X
F (y) = F
b
2. Density function for Y is:
1 y−a
 
Y
f (y) = · f X
b b
3. Expectation and variance for Y are:
E [Y ] = a + b · E [X] and Var [Y ] = b2 · Var [X]

If the scale parameter b < 0, you can derive all characteristics of Y = a + bX combining the above formulas
with reflection.

3.5 Reciprocal
Assume for the sake of simplicity that X ≥ 0 with probability one and consider Y = X −1 . Moments of Y
may fail to exist, so we shall limit our curiosity by CDF and density only.
1. CDF for Y is:
1
 
Y X
F (y) = 1 − F
y
2. Density of Y is:
1 1
 
f (y) = 2 · f X
Y
y y

4 Bivariate Distributions
Assume that observable are pairs, (X, Y ), where X and Y are random variables, discrete or continuous.
Even a mixed case is covered here, as variances and covariance are concerned.
Suppose that all necessary moments are finite.

4.1 Means and Variances


1. Mean of a pair (X, Y ) is evaluate component-wise:
E [(X, Y )] = (µX , µY )

2. Variances are also subject to component-wise evaluation:


(σX )2 = Var [X] = E [X − E [X]]2 = E [X 2 ] − (E [X])2
and similarly, (σY )2 = Var [Y ] = E [Y 2 ] − (E [Y ])2

9
4.2 Covariance and Correlation
Covariance between X and Y is defined as follows:

Cov [X, Y ] = E [(X − E [X]) · (Y − E [Y ])] = E [X · Y ] − E [X] · E [Y ]

Using variances and their square roots (standard deviations), the correlation can be introduced as follows:

Cov [X, Y ]
Corr [X, Y ] = ,
σX · σY
p p
where σX = (σX )2 and σY = (σY )2

4.3 Variance of a Sum


The variance of a sum, (X + Y ) is

Var [X + Y ] = Var [X] + 2 · Cov [X, Y ] + Var [Y ]

Combining this formula with other properties of the variance, conclude that

Var [aX + bY ] = a2 · Var [X] + 2ab · Cov [X, Y ] + b2 · Var [Y ]

In particular, variance of a difference is:

Var [X − Y ] = Var [X] − 2 · Cov [X, Y ] + Var [Y ]

4.4 Independence and Correlation


Recall that (X, Y ) are said to be independent if for any two real numbers, (u, v), events A = [X ≤ u] and
B = [Y ≤ v] are independent, that is:
\
P [A B] = P [A] · P [B]

If (X, Y ) are independent, then their covariance is zero, and therefore, its correlation also vanishes. The
converse generally is not true. For uncorrelated variables, (X, Y ), the variances of their sum or difference
coincide:
Var [X ± Y ] = Var [X] + Var [Y ]

10

You might also like