Lecture 5: Random Variables and Probability

Functions
Matt Golder & Sona Golder
Pennsylvania State University
Random Variables
Whenever an experiment is conducted, we are typically interested in the
realization of some variable, X.
The realizations of X are governed by the rules of probability regarding
potential outcomes in the sample space.
X is referred to as a random variable – it is a numerical description of the
outcome of an experiment.
Random Variables
To define a random variable, we have to assign a real number to each point in
the sample space of the experiment that denotes the value of the variable X.
This function is called a random variable (or stochastic variable) or more
precisely a random function (stochastic function).
Formally, a random variable is a real-valued function for which the domain is
the sample space.
Notes
Notes
Notes
Random Variables
Example: Suppose that a coin is tossed twice so that the sample space is
S = {HH, HT, TH, TT}.
Let X represent the number of heads that can come up.
With each sample point we can associate a number for X.
Thus, X is a random variable.
Table: Random Variable: Tossing a Coin Twice
Sample Point HH HT TH TT
X 2 1 1 0
Random Variables
A discrete random variable X is one that can assume only a finite or
countably infinite number of distinct values.
A continuous random variable is one that can assume any non-countably
infinite number of values.
Random Variables
Statistical inference involves making an inference about a population. The
event of interest to us will often correspond to the value of a random variable.
The collection of probabilities associated with the different values of a random
variable is called the probability distribution of the random variable.
The intuition of probability as it relates to discrete and continuous random
variables is identical; the only thing that is different is the math.
Let’s start by looking at discrete probability distributions.
Notes
Notes
Notes
Probability Mass Function
Definition: The probability that X takes on the value x, Pr(X = x), is defined
as the sum of the probabilities of all sample points in S that are assigned the
value x. Pr(X = x) is sometimes denoted by Pr(x) or p(x).
Probability distributions are functions that assign probabilities to each value x
of the random variable X. Thus, the probability distribution of X is also the
probability function of X i.e. f(x).
For a discrete random variable, the probability distribution is called the
probability mass function (pmf).
For a continuous random variable, it is called the probability density function
(pdf).
Probability Mass Function
The pmf, f(x), of a discrete random variable X satisfies the following
properties:
0 ≤ f(x) ≤ 1

x
f(x) = 1 where the summation is over all values of x with nonzero
probability.
Example: A couple plan to have 3 children, and are interested in the number of
girls they might have.
We can redefine “the number of girls” as a random variable X that takes on
the values 0, 1, 2, 3, 4 i.e. X = {0, 1, 2, 3}.
Associated with each value is a probability, derived from the original sample
space.
Probability Mass Function
Figure: Probability Mass Function for Number of Girls in a Three-Child Family
I (if the probability of a boy on each birth is 0.52)

e P

r(e)
BB .14 B X Pr(x)

0 .14
1 .39
2 .36
3 .11
BBG .13
BGB .13
BGG .12
GBB .13
GBG .12
GGB .12
GGG .11
What is the probability that the couple will have more than one girl?
Pr(X > 1) = p(2) +p(3) = 0.47
Notes
Notes
Notes
Probability Mass Function
Figure: Probability Mass Function for Number of Girls in a Three-Child Family
I (if the probability of a boy on each birth is 0.52)

e P

r(e)
BB .14 B X Pr(x)

0 .14
1 .39
2 .36
3 .11
BBG .13
BGB .13
BGG .12
GBB .13
GBG .12
GGB .12
GGG .11
What is the probability that the couple will have more than one girl?
Pr(X > 1) = p(2) +p(3) = 0.47
Probability Mass Function
Figure: Probability Mass Function for Number of Girls in a Three-Child Family
II (if the probability of a boy on each birth is 0.52)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3
X
P
r
(
X
)

Probability Mass Function
Example: Suppose we have a sample of two U.S. voters who could support
Obama or McCain. The sample space S = {OO, OM, MO, MM}. Let’s
assume that the probability that a voter supports Obama is given by
Pr(Vi = o) = o. It follows from this that the Pr(Vi = M) = 1 −o.
Table: Obama vs McCain
Sample Point Obama Voters Probability of Sample Point
OO 2 o
2
OM 1 o(1 −o)
MO 1 (1 −o)o
MM 0 (1 −o)
2
Notes
Notes
Notes
Probability Mass Function
Let f(x) be the probability that x number of respondents prefer Obama:
Table: PMF: Obama vs McCain
Sample Point: Obama Voters f(x)
0 (1 −o)
2
1 2o(1 −o)
2 o
2
The probability that either one or both of the voters prefer Obama is
Pr(X = 1 ∪ X = 2) = f(1) +f(2) = 2o(1 −o) +o
2
.
Cumulative Probability Function
If X is at least ordinal, we may be interested in the probability that X assumes
a value less than or equal to a certain value, k.
Definition (Cumulative Probability Function): If x1, x2, . . . , xm are values of X
such that x1 < x2 < . . . < xm, then the cumulative probability function of xk
is given by:
Pr(X ≤ xk) =

x≤k
Pr(X = x) = 1 −

x>k
Pr(X = x)
or, equivalently,
Pr(X ≤ xk) = F(xk) = f(x1) +f(x2) +. . . +f(xk) =
k

i=1
f(xi)
Cumulative Probability Function
There are some important facts regarding cumulative probability functions to
note:
1 If xm is the largest value of X, then F(xm) = 1.
2 F(−∞) = 0, F(∞) = 1.
3 F(xi) −F(xi−1) = f(xi) where xi > xi−1 > xi−2 > . . ..
Notes
Notes
Notes
Cumulative Probability Function
If X takes on only a finite number of values, x1, x2, . . . , xn, then the
cumulative distribution function is given by:
F(x) =
_
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
_
0 −∞ < x < x1
f(x1) x1 ≤ x < x2
f(x1) +f(x2) x2 ≤ x < x3
.
.
.
.
.
.
f(x1) +. . . f(xn) xn ≤ x < ∞
Cumulative Probability Function
The cumulative probability for the number of girls in a three-child family, X, is
F(x) =
_
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
_
0 −∞ < x < 0
0.14 0 ≤ x < 1
0.14 + 0.39 = 0.53 1 ≤ x < 2
0.14 + 0.39 + 0.36 = 0.89 2 ≤ x < 3
0.14 + 0.39 + 0.36 + 0.11 = 1 3 ≤ x < ∞
Cumulative Probability Function
Figure: Cumulative Probability Function for Number of Girls in a Three-Child
Family
1.00
0.89
0.14
0.53
0
0.11
0.36
0.39
0.14
F(X)
3 2 1
X, Number of Girls

Notes
Notes
Notes
Continuous Probability Functions
A continuous random variable is one that can assume any non-countably
infinite number of values.
Because there are an infinite number of possible values, the probability of any
particular value is zero. The sum of the probabilities for all the possible values
is also zero.
Thus, we can’t really talk about probabilities; instead we talk about densities
and use the term probability density function (pdf).
Although the probability that a continuous random variable takes on any
particular value is zero, the probability that the variable takes on a value within
a particular interval is nonzero and can be found.
Probability and Cumulative Density Functions
The probabilities that the value of X will fall within any interval are given by
the corresponding area under the curve, f(x)
Figure: Probability Density Function for Clock-Dial Experiment

1 2 3 4 5 6 7 8 9 10 11 12
f(x)
x
1/12
The probability that the clock hand stops between 8 and 9 is
1
12
.
Probability and Cumulative Density Functions
The general idea is that the probability density function for a continuous
random variable X can be represented by a curve.
And that the probability that X assumes a value in the interval from a to b
(a < b) is given by the area under this curve bounded by a and b.
We can develop the idea of a pdf by contrasting it with a pmf.
Notes
Notes
Notes
Discrete Case
Figure: Probability Mass Function

X1 x
4
X10 X11
f(x)
x
Pr(x4 < x ≤ x10) = f(x5) +f(x6) +. . . +f(x10) =
10

i=5
f(xi)
= F(x10) −F(x4)
where F(x) represents the cumulative density function.
Continuous Case
Figure: Probability Density Function
x
4
X10
f(x)
x
We now need to use the continuous version of summation – integration.
Pr(x4 < x < x10) =
_
x10
x4
f(x)dx
Since the probability that x = x10 is zero, Pr(x4 < x < x10) and
Pr(x4 < x ≤ x10) are equivalent.
Continuous Case
Figure: Probability Density Function
x
4
X10
f(x)
x
We now need to use the continuous version of summation – integration.
Pr(x4 < x < x10) =
_
x10
x4
f(x)dx
Since the probability that x = x10 is zero, Pr(x4 < x < x10) and
Pr(x4 < x ≤ x10) are equivalent.
Notes
Notes
Notes
Probability and Cumulative Density Functions
In general, if X is a continuous random variable, then the probability that it
assumes a value in the interval from a to b is determined by
Pr(a < x < b) =
_
b
a
f(x)dx
where f(x) is the relevant probability density function.
Some important facts:
Since X must assume some value, this implies:
Pr(−∞ < x < ∞) =
_

−∞
f(x)dx = 1
Probability and Cumulative Density Functions
The probability that X will assume any value less than or equal to some
specific x is
Pr(X ≤ x) = F(x) =
_
x
−∞
f(x)dx
where F(x) is the cumulative probability of x.
Figure: Probability Density Function and its Relationship to F(x)
F(x0)
x
0
f(x)
x

Probability and Cumulative Density Functions
Figure: Cumulative Density Function

F(x)
F(x1)
0.4
0.6
F(x2)
1
0
X2 X1 X
The probability density function, f(x), is the derivative of the cumulative
density function, F(x):
f(x) =
∂F(x)
∂x
= F

(x)
Notes
Notes
Notes
Probability and Cumulative Density Functions
F(−∞) = 0 & F(∞) = 1
For a < b,
F(b)−F(a) =
_
b
−∞
f(x)dx−
_
a
−∞
f(x)dx =
_
b
a
f(x)dx = Pr(a < x < b).
Figure: Cumulative Density Function
F(a)
b
F(b)-F(a)
a

f(x)
x

Some Integration Rules
_
x
n
dx =
x
n+1
n + 1
+c
_
x
−1
dx = ln|x| +c
_
kdx = kx +c
_
e
x
dx = e
x
+c
Uniform Distribution
Figure: Uniform Density Function
a
a b −
1

b x
f(x)






A uniform density function on the interval [a, b] is a constant, i.e., f(x) = k.
1 =
_
b
a
f(x)dx =
_
b
a
kdx =
¸
¸
¸
¸
¸
b
a
kx = (b −a)k
where
k =
1
b −a
Notes
Notes
Notes
Uniform Distribution
Figure: Finding Probabilities Using a Uniform Density Function
a
a b −
1

b c d x
f(x)





Pr(c ≤ X ≤ d) = Area of Shaded Rectangle =
d −c
b −a
Uniform Distribution
Pr(c ≤ X ≤ d) =
d −c
b −a
Proof.
Pr(c ≤ X ≤ d) = F(d) −F(c) =
_
d
a
f(x)dx −
_
c
a
f(x)dx
=
¸
¸
¸
¸
¸
d
a
x
b −a

¸
¸
¸
¸
¸
c
a
x
b −a
=
_
d
b −a

a
b −a
_

_
c
b −a

a
b −a
_
=
_
d −a
b −a
_

_
c −a
b −a
_
=
d −c
b −a
where we take advantage of the fact that f(x) =
1
b−a
.
Uniform Distribution
Example: Suppose that X is a random variable with values between 0 and 5.
Then X has a uniform distribution given by
f(x) =
1
5 −0
=
1
5
What is Pr(2 ≤ X ≤ 4.5)?
Pr(2 ≤ X ≤ 4.5) =
_
4.5
2
f(x)dx =
4.5 −2
5 −0
=
1
2
Notes
Notes
Notes
Uniform Distribution
Example: Suppose that X is a random variable with values between 0 and 5.
Then X has a uniform distribution given by
f(x) =
1
5 −0
=
1
5
What is Pr(2 ≤ X ≤ 4.5)?
Pr(2 ≤ X ≤ 4.5) =
_
4.5
2
f(x)dx =
4.5 −2
5 −0
=
1
2
Uniform Distribution
Example: A random variable X that has a standard uniform density function
takes on values between 0 and 1.
f(x) =
1
1 −0
= 1
What is Pr(0 ≤ X ≤ 0.5)?
Pr(0 ≤ X ≤ 0.5) =
_
0.5
0
f(x)dx =
0.5 −0
1 −0
=
1
2
Figure: Finding Probabilities Using a Standard Uniform Density Function
0
1
1 1/2 x
f(x)





Uniform Distribution
Example: A random variable X that has a standard uniform density function
takes on values between 0 and 1.
f(x) =
1
1 −0
= 1
What is Pr(0 ≤ X ≤ 0.5)?
Pr(0 ≤ X ≤ 0.5) =
_
0.5
0
f(x)dx =
0.5 −0
1 −0
=
1
2
Figure: Finding Probabilities Using a Standard Uniform Density Function
0
1
1 1/2 x
f(x)





Notes
Notes
Notes
Exponential Distribution
Example: What is Pr(0 ≤ X ≤ 2) if X has a standard exponential density
function? The standard exponential density function is
f(x) = e
−x
.
Answer:
Pr(0 ≤ X ≤ 2) =
_
2
0
e
−x
dx =
¸
¸
¸
¸
¸
2
0
−e
−x
= −e
−2

_
−e
−0
_
= −e
−2
−(−1) = 1 −e
−2
= 1 −
1
e
2
= 1 −
1
7.3
0.86
where we take advantage of the fact that e
0
= 1.
Exponential Distribution
Example: What is Pr(0 ≤ X ≤ 2) if X has a standard exponential density
function? The standard exponential density function is
f(x) = e
−x
.
Answer:
Pr(0 ≤ X ≤ 2) =
_
2
0
e
−x
dx =
¸
¸
¸
¸
¸
2
0
−e
−x
= −e
−2

_
−e
−0
_
= −e
−2
−(−1) = 1 −e
−2
= 1 −
1
e
2
= 1 −
1
7.3
0.86
where we take advantage of the fact that e
0
= 1.
Example
Given f(x) = cx
2
, 0 ≤ x ≤ 2, and f(x) = 0 elsewhere, find the value of c for
which f(x) is a valid density function.
We require a value for c such that
_

−∞
f(x)dx = 1
=
_
2
0
cx
2
dx =
¸
¸
¸
¸
¸
2
0
cx
3
3
=
_
8
3
_
c
Thus,
_
8
3
_
c = 1, and we find that c =
3
8
. Thus, f(x) =
3
8
x
2
.
Notes
Notes
Notes
Example
Given f(x) = cx
2
, 0 ≤ x ≤ 2, and f(x) = 0 elsewhere, find the value of c for
which f(x) is a valid density function.
We require a value for c such that
_

−∞
f(x)dx = 1
=
_
2
0
cx
2
dx =
¸
¸
¸
¸
¸
2
0
cx
3
3
=
_
8
3
_
c
Thus,
_
8
3
_
c = 1, and we find that c =
3
8
. Thus, f(x) =
3
8
x
2
.
Example
Given f(x) =
3
8
x
2
, find P(1 ≤ X ≤ 2).
P(1 ≤ X ≤ 2) =
3
8
_
2
1
x
2
dx =
¸
¸
¸
¸
¸
2
1
_
3
8
_
x
3
3
=
7
8
Now find P(1 < X < 2).
Because X has a continuous distribution, it follows that
P(X = 1) = P(X = 2) = 0 and, therefore, that
P(1 ≤ X ≤ 2) = P(1 < X < 2) =
7
8
Example
Given f(x) =
3
8
x
2
, find P(1 ≤ X ≤ 2).
P(1 ≤ X ≤ 2) =
3
8
_
2
1
x
2
dx =
¸
¸
¸
¸
¸
2
1
_
3
8
_
x
3
3
=
7
8
Now find P(1 < X < 2).
Because X has a continuous distribution, it follows that
P(X = 1) = P(X = 2) = 0 and, therefore, that
P(1 ≤ X ≤ 2) = P(1 < X < 2) =
7
8
Notes
Notes
Notes
Example
Given f(x) =
3
8
x
2
, find P(1 ≤ X ≤ 2).
P(1 ≤ X ≤ 2) =
3
8
_
2
1
x
2
dx =
¸
¸
¸
¸
¸
2
1
_
3
8
_
x
3
3
=
7
8
Now find P(1 < X < 2).
Because X has a continuous distribution, it follows that
P(X = 1) = P(X = 2) = 0 and, therefore, that
P(1 ≤ X ≤ 2) = P(1 < X < 2) =
7
8
Two Random Variables
The distribution of a single random variable is known as a univariate
distribution.
But we might be interested in the intersection of two events, in which case we
need to look at joint distributions.
The joint distributions of two or more random variables are termed bivariate or
multivariate distributions.
Example: We might be interested in the possible outcomes of tossing a coin
and rolling a die. The 12 sample points associated with this experiment are
equiprobable and correspond to the 12 numerical events (x, y). Because all
pairs (x, y) occur with the same relative frequency, we assign probability
1
2
×
1
6
=
1
12
to each sample point.
Jointly Discrete Random Variables
Figure: Bivariate Probability Function where x =Rolling a Dice and
y =Tossing a Coin

6 5 4 3 2 1
x
y
1
0
f(x, y)
1/12
We can write the joint (or bivariate) probability (mass) function for X and Y
as:
p(x, y) = f(x, y) ≡ Pr(X = x ∩ Y = y), y = 1, 0, x = 1, 2, . . . , 6.
Notes
Notes
Notes
Jointly Discrete Random Variables
If X and Y are discrete random variables with joint probability function
f(x, y) = f(X = x, Y = y), then
1 f(x, y) ≥ 0 ∀x, y
2

x

y
f(x, y) = 1 where the sum is over all values (x, y) that are
assigned nonzero probabilities.
Joint and Marginal Distributions
Table: Joint Probability Table (Crosstab): Joint and Marginal Distributions
Value of X
1 2 3 4 5 6 fY (y)
Value 0 1/12 1/12 1/12 1/12 1/12 1/12 1/2
of Y 1 1/12 1/12 1/12 1/12 1/12 1/12 1/2
fX(x) 1/6 1/6 1/6 1/6 1/6 1/6 1
The probability that X assumes a certain value and Y assumes a certain value
is called the joint probability of x and y and is written f(x, y).
The probability that X will assume a certain value alone (ignoring the value of
Y ) is called the marginal probability of x and is written fX(x).
fX(xi) =

i=j
f(xi, yj) = f(xi, yj) +f(xi, yj) +· · · .
Conditional Distributions
The multiplicative law gave us the probability of the intersection of A∩ B as
P(A∩ B) = P(A)P(B|A)
It follows directly from the multiplicative law of probability that the bivariate
probability for the intersection of (x, y) is
f(x, y) = fX(x)f
Y |X
(y|x)
= fY (y)f
X|Y
(x|y)
fX(x) and fY (y) are the univariate or marginal probability distributions for X
and Y individually.
f
Y |X
(y|x) is the conditional probability that the random variable Y equals y
given that X takes on the value x and f
X|Y
(x|y) is the conditional probability
that the random variable X equals x given that Y takes on the value y.
Notes
Notes
Notes
Conditional Distributions
Definition: If X and Y are jointly discrete random variables with joint
probability function f(x, y) and marginal probability functions fX(x) and
fY (y), respectively, then the conditional discrete probability function of X
given Y is
p(x|y) = P(X = x|Y = y) =
P(X = x, Y = y)
P(Y = y)
= fXY (x|y) =
f(x, y)
fY (y)
provided that fY (y) > 0.
Example: The probability of rolling a six on the die given that you have tossed
a head with the coin is
f
X|Y
(6|1) =
f(6, 1)
fY (1)
=
1
12
1
2
=
1
6
.
Conditional Distributions
Figure: X = Number of Girls and Y = Number of Runs Defined on the
Sample Space
Outcome
e Pr(e) X value Y value
BBB
BBG
BGB
BGG
GBB
GBG
GGB
GGG
.14
.13
.13
.12
.13
.12
.12
.11
0
1
1
2
1
2
2
3
1
2
3
2
2
3
2
1

Conditional Distributions
Figure: Bivariate Probability Function where X =Number of Girls and
Y =Number of Runs

p(x, y)
y
x
1
2
3
1 0 2 3

Notes
Notes
Notes
Conditional Distributions
Figure: Joint Probability Table for x =Number of Girls and y =Number of
Runs
y
x
1 2 3 p(x)
0
1
2
3
.14
0
0
.11
0
.26
.24
0
0
.13
.12
0
.14
.39
.36
.11
p(y) .25 .50 .25 1.00

The joint probability of (1, 2) i.e. one girl and a run of two is f(1, 2) = 0.26.
The marginal probability of one girl is fX(1) = 0 + 0.26 + 0.13 = 0.39.
The conditional probability of 1 girl given a run of two is:
f
X|Y
(1|2) =
f(1, 2)
fY (2)
=
0.26
0.50
= 0.52
Independence
Two events A and B are said to be independent if P(A∩ B) = P(A) ×P(B).
If X and Y are discrete random variables with joint probability function f(x, y)
and marginal probability functions fX(x) and fY (y), respectively, then X and
Y are independent if and only if
P(X = x, Y = y) = P(X = x)P(Y = y)
or equivalently
f(x, y) = fX(x)fY (y)
for all pairs of real numbers (x, y).
Independence
Are rolling a die (X) and tossing a coin (Y ) independent?
Does f(1, 0) = fX(1)fY (0)?
Yes i.e.
1
12
=
1
6
×
1
2
=
1
12
.
X and Y are independent random variables.
Whenever X and Y are independent, then the rows (and columns) of the joint
probability table f(x, y) will be proportional.
Notes
Notes
Notes
Independence
Are rolling a die (X) and tossing a coin (Y ) independent?
Does f(1, 0) = fX(1)fY (0)? Yes i.e.
1
12
=
1
6
×
1
2
=
1
12
.
X and Y are independent random variables.
Whenever X and Y are independent, then the rows (and columns) of the joint
probability table f(x, y) will be proportional.
Independence
Figure: X = Number of Girls and Y = Number of Runs Defined on the
Sample Space
Outcome
e Pr(e) X value Y value
BBB
BBG
BGB
BGG
GBB
GBG
GGB
GGG
.14
.13
.13
.12
.13
.12
.12
.11
0
1
1
2
1
2
2
3
1
2
3
2
2
3
2
1

Are X and Y independent?
Independence
Figure: Joint Probability Table for x =Number of Girls and y =Number of
Runs
y
x
1 2 3 p(x)
0
1
2
3
.14
0
0
.11
0
.26
.24
0
0
.13
.12
0
.14
.39
.36
.11
p(y) .25 .50 .25 1.00

If X and Y were independent, then f(0, 1) would equal fX(0)fY (1). However,
this is not the case: 0.14 = 0.14 ×0.25. X and Y are dependent.
Notes
Notes
Notes
Jointly Continuous Random Variables
Bivariate probability density functions give the joint probability of certain
events, f(x, y).
If X and Y are jointly continuous random variables with a joint density
function given by f(x, y) = f(X = x, Y = y), then
1 f(x, y) ≥ 0 ∀x, y
2
_

−∞
_

−∞
f(x, y)dxdy = 1.
The probability that X lies between a1 and a2 while Y lies between b1 and b2
is:
P(a1 ≤ X ≤ a2, b1 ≤ Y ≤ b2) =
_
b2
b1
_
a2
a1
f(x, y)dxdy
Marginal Distributions
Definition: Let X and Y be jointly continuous random variables with a joint
density function given by f(x, y) = f(X = x, Y = y). Then the marginal
density functions of X and Y , respectively are given by
fX(x) =
_

−∞
f(x, y)dy
fY (y) =
_

−∞
f(x, y)dx
Conditional Distributions
Definition: Let X and Y be jointly continuous random variables with joint
density f(x, y) and marginal densities fX(x) and fY (y).
For any y such that fY (y) > 0, the conditional density of X given Y = y is
given by:
fXY (x, y) =
f(x, y)
fY (y)
and for any x such that fX(x) > 0, the conditional density of Y given X = x
is given by:
fY X(x, y) =
f(x, y)
fX(x)
Notes
Notes
Notes
Independence
Definition: If X and Y are continuous random variables with joint density
function f(x, y) and marginal densities fX(x) and fY (y), respectively, then X
and Y are independent if and only if
f(x, y) = fX(x)fY (y)
for all pairs of real numbers (x, y).
Mathematical Expectations
The probability distribution for a random variable is a theoretical model for the
empirical distribution of data associated with a real population.
Some of the most important characteristics of probability distributions are
termed expected values or mathematical expectations.
We are often interested in the mean and variance of probability distributions.
These numerical descriptive measures provide the parameters for the probability
distribution, f(x).
Recall that while we use
¯
X and s
2
to denote the mean and variance in our
sample (sampling distribution), we use the Greek letters µ and σ
2
for our
probability (population) distribution.
Expected Value of X (Discrete Random Variable)
The sample mean is ¯ x =

m
i=1
fi
n
xi. This varies from sample to sample even
when the sample size n is large.
x f(x) Relative Frequency in Sample
x1 f(x1)
f1
n
x2 f(x2)
f2
n
.
.
.
.
.
.
.
.
.
xm f(xm)
fm
n
Notes
Notes
Notes
Expected Value of X (Discrete Random Variable)
The mean (and variance) are derived in a similar way in that they depend on
the values of x and the associated probabilities.
However, instead of using experimentally derived frequencies, we use the
mathematically derived probability distribution.
The expected value, or mean, of a theoretical probability function is
E(X) =
m

i=1
xif(xi).
Expected Value of X (Discrete Random Variable)
Definition: Let X be a discrete random variable with probability function f(x).
Then the expected value of X, E(X), is defined as:
E(X) =

x
xf(x)
In other words,
E(X) is equal to the probability-weighted mean of the values of X.
E(X) can be thought of as

(Value×Probability) for all possible values.
If f(x) is an accurate description of the population frequency distribution, then
E(X) = µ, the population mean.
Expected Value of X (Discrete Random Variable)
The term “expected value” is used to emphasize the relation between the
population mean and one’s anticipation about the outcome of an experiment.
In effect, the expected value of X is equivalent to the
f
n
that we would expect
to get if we could repeat whatever experiment we are looking at an infinite
number of times.
It obviously follows from this that E(X) need not be a value taken on by X.
Notes
Notes
Notes
Expected Value of X (Discrete Random Variable)
Example: Suppose that a game is to be played with a single die. Let X be the
random variable giving the amount of money to be won on any toss.
Table: Die Rolling Wager
1 2 3 4 5 6
xj 0 20 0 40 0 -30
f(xj)
1
6
1
6
1
6
1
6
1
6
1
6
E(X) = 0
1
6
+ 20
1
6
+ 0
1
6
+ 40
1
6
+ 0
1
6
+ (−30)
1
6
= 5
Expected Value of X (Continuous Random Variable)
The intuition for a continuous random variable is the same as for the discrete
case: E(X) is a “typical” value, gained by “summing” across values of X . . .
Definition: Let X be a continuous random variable with probability function
f(x). Then the expected value of X, E(X), is defined as:
E(X) =
_

−∞
xf(x)dx
Expected Value of X (Continuous Random Variable)
Example: The density function of a random variable X is given by f(x) =
1
2
x
for 0 < x < 2, 0 otherwise. Find E(X).
E(X) =
_

−∞
xf(x)dx =
_
2
0
x
_
1
2
x
_
dx =
_
2
0
x
2
2
dx =
x
3
6
¸
¸
¸
¸
¸
2
0
=
4
3
Notes
Notes
Notes
Expected Value of X (Continuous Random Variable)
Example: Consider: f(x) =
1
2
(x + 1) for −1 < x < 1 (with f(x) = 0
otherwise):
What is the expected value (mean) of f(x)?
Expected Value of X (Continuous Random Variable)
E[f(x)] =
_
1
−1
x
_
x + 1
2
_
dx
=
_
1
−1
1
2
(x
2
+x) dx
=
1
2
_
1
−1
x
2
dx +
1
2
_
1
−1
xdx
=
1
2
_
x
3
3

¸
¸
¸
1
−1
+
1
2
_
x
2
2

¸
¸
¸
1
−1
=
1
2
_
x
3
3
+
x
2
2

¸
¸
¸
1
−1
=
1
2
__
1
3
3
+
1
2
2
_

_
−1
3
3
+
−1
2
2
__
=
1
3
Some Theorems on Expectation
1 If c is any constant, then
E(c) = c
and
E(cX) = cE(X)
2 If X and Y are any random variables, then
E(X +Y ) = E(X) +E(Y )
3 If X and Y are independent random variables, then
E(XY ) = E(X)E(Y )
4 If X and Y are independent random variables, then
E(aX +b) = aE(X) +b
Notes
Notes
Notes
Some Theorems on Expectation
Proof.
E(X +Y ) =

i=1

j=1
f(xi, yj)(X +Y )
=

i=1

j=1
f(xi, yj)xi +

i=1

j=1
f(xi, yj)yj
=

i=1
xi

j=1
f(xi, yj) +

i=1
yj

j=1
f(xi, yj)
=

i=1
xifX(xi) +

j=1
yjfY (yj)
= E(X) +E(Y )
Some Theorems on Expectation
Proof.
E(aX +b) =
n

i=1
(aX +b)f(x)
=
n

i=1
aXf(x) +
n

i=1
bf(x)
= a
n

i=1
Xf(x) +b
n

i=1
f(x)
= a
n

i=1
Xf(x) +b = aE(X) +b
where we took advantage of the fact that

n
i=1
f(x) = 1.
Expectations
Example: What is the expected value of g(X) = 2 + 3X, where X is a random
variable obtained by rolling a die?
From the theorems above, we know that if X is a random variable and a and b
are constants, then
E(aX +b) = aE(X) +b
E(X) = 1
1
6
+ 2
1
6
+ 3
1
6
+ 4
1
6
+ 5
1
6
+ 6
1
6
=
21
6
= 3.5
E(g(X)) = 2 + 3E(X) = 2 + 3(3.5) = 12.5.
Notes
Notes
Notes
Expectations
This method generalizes quite easily:
E(g(X)) =
_

i=1
g(xi)f(xi) (X discrete)
_

−∞
g(x)f(x)dx (X continuous)
In terms of our example, therefore, we have:
E(g(X)) = [2 + 3(1)]
1
6
+ [2 + 3(2)]
1
6
+ [2 + 3(3)]
1
6
+ [2 + 3(4)]
1
6
+ [2 + 3(5)]
1
6
+ [2 + 3(6)]
1
6
= [2 + 3]
1
6
+ [2 + 6]
1
6
+ [2 + 9]
1
6
+ [2 + 12]
1
6
+ [2 + 15]
1
6
+ [2 + 18]
1
6
=
5
6
+
8
6
+
11
6
+
14
6
+
17
6
+
20
6
=
75
6
= 12.5
Expectations
Example: What is the expected value of X when X has a standard uniform
distribution i.e. f(x) = 1 and 0 ≤ X ≤ 1
E(X) =
_
1
0
xf(x)dx =
_
1
0
x1dx
=
x
2
2
¸
¸
¸
¸
¸
1
0
=
1
2
−0 =
1
2
Expectations
Example: What is the expected value of X when X has a standard uniform
distribution i.e. f(x) = 1 and 0 ≤ X ≤ 1
E(X) =
_
1
0
xf(x)dx =
_
1
0
x1dx
=
x
2
2
¸
¸
¸
¸
¸
1
0
=
1
2
−0 =
1
2
Notes
Notes
Notes
Expectations
Example: What is the expected value of X, when X is distributed according to
a standard exponential i.e. f(x) = e
−x
?
E(X) =
_

−∞
xf(x)dx
=
_

0
xe
−x
dx
This last part follows from the fact that the exponential distribution only goes
from 0 to ∞.
Expectations
E(X) =
_

0
xe
−x
dx
To solve this, we need to do integration by parts.
E(X) =
_

−∞
xf(x)dx
=
_

0
xe
−x
dx
= −xe
−x

_

0
e
−x
dx
= −xe
−x

_
−e
−x
_
¸
¸
¸
¸
¸

0
= −∞e
−∞
+e


_
−0e
−0

_
−e
−0
__
=
−∞
e

+ 0 + 1 = 1 +
−∞
e

Expectations
It turns out that
lim
x→∞
1
e
x
=
1
e

= 0
Substituting in, we, therefore, have:
E(X) = 1 +
−∞
e

= 1 + 0 = 1
Notes
Notes
Notes
Expectations
Example: A woman leaves for work between 8 a.m. and 8.30 a.m., and takes
between 40 and 60 minutes to get there.
Let the random variable X denote her time of departure, and the random
variable Y the travel time.
Assuming that these variables are independent and uniformly distributed, find
the following:
1 Her expected arrival time.
2 The probability that she arrives before 9 a.m.
Expectations
The expected arrival time is E(X +Y ) = E(X) +E(Y ).
E(X) =
_
8.30
8.00
xf(x)dx =
_
8.30
8.00
x
1
30
dx
=
x
2
60
¸
¸
¸
¸
¸
8.30
8.00
Let’s change to minutes past 8.
E(X) =
x
2
60
¸
¸
¸
¸
¸
30
0
=
900
60
−0 = 15
A similar method shows that E(Y ) = 50. Thus, E(X) +E(Y ) = 65. If we
start at 8 a.m. and then add 65 minutes, we see that the expected arrival time
is 9.05 a.m.
Expectations
What’s the probability that she arrives before 9 a.m. i.e. P(X +Y ) ≤ 9 a.m.?
Let’s convert this to minutes past 8 a.m. In other words, we want to know
P(X +Y ) ≤ 60.
Put differently, we want to know FZ(60) where Z = X +Y and FZ is the
cumulative density function of Z.
We can think about this graphically.
Notes
Notes
Notes
Expectations
Figure: Travel Times
Y
60
40
X 20 30
The dotted rectangle indicates all the points where the joint density of X and
Y are nonzero.
The joint density of X and Y is uniformly distributed over this rectangle.
The grey triangle represents all the points that correspond to the event Z ≤ 60.
Expectations
Figure: Travel Times
Y
60
40
X 20 30
The area of the whole rectangle is 20 ×30 = 600 and the area of the triangle is
1
2
(20) ×20 = 200.
Thus, the probability that the woman arrives before 9 a.m. is
200
600
=
1
3
.
Expectations: Two Random Variables
Definition: Let g(X, Y ) be a function of the discrete random variables, X and
Y , which have bivariate probability function f(x, y). Then the expected value
of g(X, Y ) is
E[g(X, Y )] =

i=1

j=1
g(xi, yj)f(xi, yj)
Definition: Let g(X, Y ) be a function of the continuous random variables, X
and Y , which have bivariate density function f(x, y). Then the expected value
of g(X, Y ) is
E[g(X, Y )] =
_

−∞
_

−∞
g(xi, yj)f(xi, yj)dxdy
Notes
Notes
Notes
Expectations: Two Random Variables
Example: In planning a three-child family, suppose that annual clothing costs
are:
R = g(x, y) = 10 +x +y
Calculate the expected cost E(R)?
E[g(X, Y )] =

i=1

j=1
g(xi, yj)f(xi, yj)
where R = g(x, y) = 10 +x +y.
Expectations: Two Random Variables
This is the joint probability table for f(x, y).
Y
X 1 2 3
0 0.14 0 0
1 0 0.26 0.13
2 0 0.24 0.12
3 0.11 0 0
Now we need to calculate g(x, y).
Expectations: Two Random Variables
This is the joint probability table for g(x, y).
Y
X 1 2 3
0 (10+0+1)0.14 0 0
1 0 (10+1+2)0.26 (10+1+3)0.13
2 0 (10+2+2)0.24 (10+2+3)0.12
3 (10+3+1)0.11 0 0
If we sum up all these values across rows and columns, we have E(R) = 13.44.
Notes
Notes
Notes
Expectations: Two Random Variables
Example: Let X and Y have a joint density function given by f(x, y) = 2x for
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. Find E(XY ).
E(XY ) =
_

−∞
_

−∞
xyf(x, y)dxdy
=
_
1
0
_
1
0
xy(2x)dxdy
=
_
1
0
y
_
2x
3
3
¸
¸
¸
¸
¸
1
0
_
dy =
_
1
0
_
2
3
_
ydy
=
2
3
y
2
2
¸
¸
¸
¸
¸
1
0
=
1
3
Expectations: Two Random Variables
Example: Let X and Y have a joint density function given by f(x, y) = 2x for
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, 0 otherwise. Find E(X).
It turns out that if X and Y are two continuous random variables with joint
density function f(x, y), then the means or expectations of X and Y are
E(X) =
_

−∞
_

−∞
xf(x, y)dxdy
E(Y ) =
_

−∞
_

−∞
yf(x, y)dxdy
Expectations: Two Random Variables
Thus, in our example, we have:
E(X) =
_

−∞
_

−∞
xf(x, y)dxdy
=
_
1
0
_
1
0
x(2x)dxdy
=
_
1
0
_
2x
3
3
¸
¸
¸
¸
¸
1
0
_
dy =
_
1
0
_
2
3
_
dy
=
2
3
y
¸
¸
¸
¸
¸
1
0
=
2
3
Notes
Notes
Notes
Variance: Discrete Random Variable
Definition: If X is a discrete random variable with mean E(X) = µ, then the
variance of a random variable X is defined to be the expected value of
(X −µ)
2
. That is:
Variance, σ
2
= E{[X −E(X)]
2
}
= E[(X −µ)
2
]
=

x
(x −µ)
2
f(x)
The standard deviation of X, σ, is the positive square root of σ
2
.
There is an alternative way of calculating the variance
Variance, σ
2
= E(X
2
) −[E(X)]
2
= E(X
2
) −µ
2
=

x
x
2
f(x) −µ
2
Variance: Discrete Random Variable
Table: Calculating the Mean and Variance of X = Number of Girls
Given Probability Calculation of µ Calculation of σ
2
Calculation of σ
2
Distribution µ =

x
xf(x) σ
2
=

x
(x −µ)
2
f(x) σ
2
=

x
x
2
f(x) −µ
2
x f(x) xf(x) (x −µ) (x −µ)
2
(x −µ)
2
f(x) x
2
f(x)
0 0.14 0 -1.44 2.07 0.29 0
1 0.39 0.39 -0.44 0.19 0.08 0.39
2 0.36 0.72 0.56 0.31 0.11 1.44
3 0.11 0.33 1.56 2.43 0.27 0.99
µ = 1.44 σ
2
= 0.75

x
x
2
p(x) = 2.82
µ
2
= 2.07
Difference = σ
2
= 0.75
Variance: Discrete Random Variable
Bikes Sold Example: On the basis of past experience, the buyer for a large
sports store estimates that the number of 10-speed bicycles sold next year will
be somewhere between 40 and 90 with the distribution shown in the first two
columns of the table on the next slide.
Questions:
What is the mean number sold? What is the estimated standard
deviation?
If 60 are ordered, what is the chance they will all be sold? What is the
chance some will be left over?
To be almost sure (95%) of having enough bicycles, how many should be
ordered?
Notes
Notes
Notes
Variance: Discrete Random Variable
Table: Calculating the Mean and Variance of X = Number of Bikes Sold
Given Probability Calculation of µ Calculation of σ
2
Calculation of σ
2
Distribution µ =

x
xp(x) σ
2
=

x
(x −µ)
2
p(x) σ
2
=

x
x
2
p(x) −µ
2
x p(x) xp(x) (x −µ) (x −µ)
2
(x −µ)
2
p(x) x
2
p(x)
40 0.05 2 -22 484 24.2 80
50 0.15 7.5 -12 144 21.6 375
60 0.41 24.6 -2 4 1.64 1476
70 0.34 23.8 8 64 21.76 1666
80 0.04 3.2 18 324 12.96 256
90 0.01 0.9 28 784 7.84 81
µ = 62 σ
2
= 90

x
x
2
p(x) = 3934
µ
2
= 3844
Difference = σ
2
= 90
Variance: Discrete Random Variable
Answers:
The mean is 62 and the standard deviation is

90 = 9.5.
Pr(x ≥ 60) = 0.41 + 0.34 + 0.04 + 0.01 = 0.80 i.e. 80% chance.
Pr(x < 60) = 1 −P(x ≥ 60) = 0.20 i.e. 20% chance.
Notice the distribution of p(x). The p(x ≤ 70) = .95. So, to be 95% sure
of having enough bicycles, we should order 70 bicycles.
Variance: Continuous Random Variable
Definition: If X is a continuous random variable with mean E(X) = µ, then
the variance of a random variable X is defined to be the expected value of
(X −µ)
2
. That is:
Variance, σ
2
= E{[X −E(X)]
2
}
= E[(X −µ)
2
]
=
_

−∞
(x −µ)
2
f(x)dx
The standard deviation of X, σ, is the positive square root of σ
2
.
Notes
Notes
Notes
Variance: Continuous Random Variable
Example: The density function of a random variable X is given by f(x) =
1
2
x
for 0 < x < 2, 0 otherwise. Find the variance and standard deviation of this
variable. We already found that µ = E(X) =
4
3
in one of our earlier examples.
σ
2
= E
_
_
x −
4
3
_
2
_
=
_

−∞
_
x −
4
3
_
2
f(x)dx
=
_
2
0
_
x −
4
3
_
2
_
1
2
x
_
dx =
2
9
and so the standard deviation is
σ =
_
2
9
=

2
3
Some Theorems on Variance
1 If c is any constant, then
Var(c) = 0
and
Var(cX) = c
2
Var(X)
2 The quantity E[(X −a)
2
] is a minimum when a = µ = E(X).
3 If X and Y are independent random variables, then
(a) Var(X +Y ) = Var(X) + Var(Y )
(b) Var(X −Y ) = Var(X) + Var(Y )
Theorem (a) indicates that the variance of a sum of independent
variables equals the sum of their variances.
4 If X is a random variable and a and b are constants, then
Var(aX +b) = a
2
Var(X)
Some Theorems on Variance
Proof.
Var(X −Y ) = Var(X) + Var(Y )
Define a = −1. Then
Var(X +aY ) = Var(X) + Var(aY )
= Var(X) +a
2
Var(Y )
= Var(X) + Var(Y )
Notes
Notes
Notes
Some Theorems on Variance
Proof.
Var(aX +b) = E[(aX +b) −E(ax +b)]
2
and since E(aX +b) = aµ +b
Var(aX +b) = E(aX −aµ)
2
= a
2
E[(X −µ)
2
] = a
2
Var(X)
Covariance (Variance of Joint Distributions)
If X and Y are two continuous random variables with joint density function
f(x, y), then we have already seen that:
E(XY ) =
_

−∞
_

−∞
xyf(x, y)dxdy
E(X) =
_

−∞
_

−∞
xf(x, y)dxdy
E(Y ) =
_

−∞
_

−∞
yf(x, y)dxdy
The variances of X and Y are:
var(X) = σ
2
X
=
_

−∞
_

−∞
(x −µX)
2
f(x, y)dxdy
var(Y ) = σ
2
Y
=
_

−∞
_

−∞
(y −µY )
2
f(x, y)dxdy
Covariance
Rather than talk about the joint variance of X and Y , we talk about the
covariance of X and Y .
Definition: If X and Y are random variables with means µX and µY ,
respectively, then the covariance of X and Y , often denoted σXY is calculated
as:
Cov(X, Y ) = σXY = E[(X −E(X))(Y −E(Y ))]
= E[(X −µX)(Y −µY )]
We can also write the covariance as
Cov(X, Y ) = σXY = E(XY ) −E(X)E(Y )
Notes
Notes
Notes
Covariance
Proof.
Cov(X, Y ) = E[(X −E(X))(Y −E(Y ))]
= E[(X −µX)(Y −µY )]
= E(XY −µXY −µY X +µXµY )
= E(XY ) −µXE(Y ) −µY E(X) +µXµY
= E(XY ) −E(X)E(Y )
= E(XY ) −µXµY
Covariance (Intuition)
Figure: Dependent and Independent Observations
X
Y Y
X μX
μX
μY μY
(b) (a)

Covariance is a measure of association between two variables – it indicates how
the values of one variable depend on the values of another.
Why? Well, it has to do with the fact that the average value of
(x −µX)(y −µY ) provides a measure of the linear dependence between X and
Y .
Covariance (Intuition)
Suppose we locate a plotted point (x, y) on Figure a, and measured the
deviations (x −µX) and (y −µY ).
Both deviations assume the same algebraic sign for any point (x, y) and their
product (x −µX)(y −µY ) is positive.
Points to the right of µX yield pairs of positive deviations, while points to the
left yield negative deviations.
The average of the product of the deviations (x −µX)(y −µY ) is large and
positive.
If the linear relationship indicated in Figure a had sloped down and to the right,
all corresponding pairs of deviations would have the opposite sign, and the
average value of (x −µX)(y −µY ) would have been a large negative number.
Notes
Notes
Notes
Covariance (Intuition)
The situation just described does not occur for Figure b, where little
dependence exists between X and Y .
Their corresponding deviations (x −µX) and (y −µY ) will assume the same
algebraic sign for some points and opposite signs for others.
Thus, the product (x −µX)(y −µY ) will be positive for some points, negative
for others, and will average to some value near 0.
From this we can see that the average value of (x −µX)(y −µY ) provides a
measure of the linear dependence between X and Y .
Covariance
The larger the absolute value of the covariance, the greater the linear
dependence between X and Y .
The covariance between two variables can be anywhere between −∞ and +∞.
Positive values indicate that X increases as Y increases, and negative values
indicate that X decreases as Y increases.
A zero value of the covariance indicates that the variables are uncorrelated, and
that there is no linear dependence between X and Y .
Covariance and Statistical Independence
If X and Y are independent random variables, then
Cov(X, Y ) = 0
Thus, independent variables must be uncorrelated.
But Cov(X, Y ) = 0 does not mean that X and Y are independent random
variables.
Notes
Notes
Notes
Covariance and Statistical Independence
Example: Consider the following joint probability.
Values of Y
-1 0 1 f(x)
-1 1/16 3/16 1/16 5/16
Values of X 0 3/16 0 3/16 6/16
1 1/16 3/16 1/16 5/16
f(y) 5/16 6/16 5/16 1
fXY (0, 0) = fX(0)fY (0)
0 =
6
16
×
6
16
As a result, we can see that X and Y are not independent.
Covariance and Statistical Independence
Cov(X, Y ) = E(XY ) −E(X)E(Y )
E(XY ) =

i=1

j=1
xiyjf(xi, yj)
= (−1)(−1)
1
16
+ (−1)(0)
3
16
+ (−1)(1)
1
16
+ (0)(−1)
3
16
+ (0)(0)(0) + (0)(1)
3
16
+ (1)(−1)(1/16) + (1)(0)(3/16) + (1)(1)(1/16)
= (1/16) −(1/16) −(1/16) + (1/16) = 0
E(X) = −1 ×
5
16
+ 0 ×
6
16
+ 1 ×
5
16
= 0
E(Y ) = −1 ×
5
16
+ 0 ×
6
16
+ 1 ×
5
16
= 0
Thus,
Cov(X, Y ) = 0 −0(0) = 0
Variance Again
When X and Y are independent random variables, we have
Var(X +Y ) = Var(X) + Var(Y )
Var(X −Y ) = Var(X) + Var(Y )
More generally, though, if X and Y are random variables, then
Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Var(X −Y ) = Var(X) + Var(Y ) −2Cov(X, Y )
Similarly, for a weighted sum, we have:
Var(aX +bY ) = a
2
Var(X) +b
2
Var(Y ) + 2abCov(X, Y )
Notes
Notes
Notes
Variance Again
Proof.
Var(X +Y ) = E[(X +Y ) −E(X +Y )]
2
= E[X −E(X) +Y −E(Y )]
2
= E[X −E(X)]
2
+E[Y −E(Y )]
2
+ 2E[X −E(X)][Y −E(Y )]
= Var(X) + Var(Y ) + 2Cov(X, Y )
Variance Again
Values of Y
-1 0 1 f(x)
-1 1/16 3/16 1/16 5/16
Values of X 0 3/16 0 3/16 6/16
1 1/16 3/16 1/16 5/16
f(y) 5/16 6/16 5/16 1
What’s Var(X +Y )?
One way to answer this is to define a new variable Z = X +Y .
Variance Again
zi f(zi)
-2
1
16
-1
6
16
0
2
16
1
6
16
2
1
16
Var(Z) = E(Z
2
) −[E(Z)]
2
=

z
2
i
fZ(zi) −
_

zifZ(zi)
_
2
= −2
2
_
1
16
_
+ (−1
2
)
_
6
16
_
+ 0 + 1
2
_
6
16
_
+ 2
2
_
1
16
_

_
−2
_
1
16
_
+ (−1)
_
6
16
_
+ 0 + 1
_
6
16
_
+ 2
_
1
16
__
2
=
20
16
−0 =
20
16
= 1.25
Notes
Notes
Notes
Variance Again
Or we can start from the following definition:
Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Since we know that Cov(X, Y ) = 0, we need only calculate Var(X) and
Var(Y ).
Var(X) =

xifX(xi) −
_

xifX(xi)
_
2
= (−1
2
)
_
5
16
_
+ 0 + (1
2
)
_
5
16
_

_
−1
_
5
16
_
+ 0 + 1
_
5
16
__
2
=
10
16
−(0)
2
=
10
16
(1)
Variance Again
Var(Y ) =
10
16
Thus,
Var(X +Y ) =
10
16
+
10
16
=
20
16
= 1.25
Correlation Coefficient
It is difficult to know whether a covariance is large or small because its value
depends upon the scale of measurement.
Cov(aX, bY ) = E[aX −E(aX)][bY −E(bY )]
= E[a(X −E(X))][b(Y −E(Y ))]
= abE[X −E(X)][Y −E(Y )]
= abCov(X, Y )
Notes
Notes
Notes
Correlation Coefficient
This problem can be eliminated by standardizing its value and using the
correlation coefficient, ρ
Correlation Coefficient, ρ =
Cov(X, Y )
σXσY
=
σXY
σXσY
where σX and σY are the standard deviations of X and Y respectively.
−1 ≤ ρ ≤ 1
When ρ = 1, then we have perfect correlation, with all points falling on a
straight line with positive slope. ρ = 0 implies zero covariance and no
correlation.
ρ < 0 implies that Y decreases as X increases, with ρ = −1 implying perfect
correlation with all points falling on a straight line with negative slope.
Correlation Coefficient: Things to Be Aware Of
The correlation coefficient ρ is a measure of linear dependence only. It
does not necessarily indicate if there is a systematic relationship between
two variables X and Y .
As with covariance, if X and Y are independent, then ρ = 0. But ρ = 0
does not mean that X and Y are independent.
ρ might be large but this does not mean that Y necessarily changes by a
lot when X changes by a lot, or vice versa.
Figure: Correlation vs Responsiveness
X
Y
X
(b) (a)
Y

Correlation Coefficient: Things to Be Aware Of
The strength of the correlation coefficient ρ depends on (i) the strength
of the linear relationship between X and Y AND (ii) the variance of X
and Y . This is obviously problematic if we want to measure the strength
of the linear relationship between X and Y .
Figure: Correlation Coefficient and Variation in X

m
e
m
b
e
r
o
f
c
o
n
g
r
e
s
s

m
e
m
b
e
r
o
f
c
o
n
g
r
e
s
s

foreign policy
civil rights
Notes
Notes
Notes
Conditional Expectations
If X and Y are any two random variables, then the conditional expectation of
Y given that X = x is defined to be:
E(Y |X) =
_


i=1
yif
Y |X
(yi|xi) in the discrete case
_

−∞
yif
Y |X
(y|x)dy in the continuous case
In term of the example just given, the conditional expectation of Y given that
X is equal to -1 is
E(Y |x = −1) = −1 ×
1/16
5/16
+ 0 + 1 ×
1/16
5/16
= 0
Also,
E(Y |x = 0) = −1 ×
3/16
6/16
+ 0 + 1 ×
3/16
6/16
= 0
and
E(Y |x = 1) = −1 ×
1/16
5/16
+ 0 + 1 ×
1/16
5/16
= 0
Conditional Expectations
Conditional expectations have the following properties:
(1) E(Y |X = x) = E(Y ) when X and Y are independent
and
(2) E(Y ) =
_

−∞
E(Y |X = x)fX(x)dx = E[E(Y |X)]
where on the right-hand side the inside expectation is with respect to the
conditional distribution of Y given X and the outside expectation is with
respect to the distribution of X.
Conditional Expectations
Example: The average travel time to a city is c hours by car or b hours by bus.
A woman cannot decide whether to drive or take the bus, and so tosses a coin.
What is her expected travel time?
We are dealing with the joint distribution of the outcome of the toss, X, and
the travel time Y , where Y = Ycar if X = 0 and Y = Ybus if X = 1. Let’s
assume that X and Y are independent so that by the first property above we
have:
E(Y |X = 0) = E(Ycar|X = 0) = E(Ycar) = c
E(Y |X = 1) = E(Ybus|X = 1) = E(Ybus) = b
Then by the second property above
E(Y ) = E(Y |X = 0)P(X = 0) +E(Y |X = 1)P(X = 1) =
c +b
2
Notes
Notes
Notes
Conditional Variances
If X and Y are any two random variables, then the conditional variance of Y
given that X = x is defined to be:
V (Y |X = x) = E[(Y −E(Y |X = x))
2
|X = x]
= E(Y
2
|X = x) −[E(Y |X = x)]
2
Summary
Mathematical
Expectations Population Sample
Mean µ = E(X)
¯
X =
1
N

n
i=1
Xi
Variance σ
2
= E[(X −µ)
2
] s
2
=
1
N−1

N
i=1
(Xi −
¯
X)
2
Covariance CovXY = E[(X −µX)(Y −µY )] CovXY =

N
i=1
(Xi−
¯
X)(Yi−
¯
Y )
N−1
Correlation ρXY =
E[(X−µX)(Y −µY )]
σXσY
rXY =

N
i=1
(Xi−
¯
X)(Yi−
¯
Y )
(N−1)sXsY
Notes
Notes
Notes