You are on page 1of 5

1 Expected Values of Random Variables

Often it is convenient to describe a random variable using some location measure. The most important
location measure is the expected value (a.k.a. the mean or the weighted average). For a discrete random
variable X its expected value, denote E[X], is given by
X
E[X] = xk p(xk ).
k

For a continuous random variable Z ∞


E[X] = xf (x)dx.
−∞

Remark: Strictly speaking E[X] exists only if both E[X + ] and E[X − ] exists and are not both ∞ where
X = max(X, 0) and X − = max(−X, 0).
+

Example: Suppose X takes values {0, 1, 2, 3} with probabilities {1/8, 3/8, 3/8, 1/8}. Then, E[X] = 0(1/8)+
1(3/8) + 2(3/8) + 3(1/8) = 12/8 = 1.5.
Remark: E[X] need not be a possible outcome of X.
Frequency, or long run average interpretation of the expected value: Example: Suppose a random
variable X represents the profit associated with the production of some item that can be defective or non-
defective. Suppose that the profit is −2 when the item is defective and 10 when the item is non-defective.
Finally, assume that p(−2) = 0.1, p(10) = 0.9. Then E[X] = −2(0.1) + 10(.9) = 8.8. Suppose that a very
large number n of items are produced and let n(G) be the number of good items and n(D) is the number of
defective items. Then the average profit per items is

n(D) n(G)
−2 + 10
n n
The frequency interpretation is that n(G)/n converges in some sense to be defined later to p(10) = 0.9. This
convergence is know as the Law of Large Numbers.

1.1 Markov’s Inequality


Proposition: Suppose X is a nonnegative random variable and c > 0. Then

E[X]
P (X ≥ c) ≤ .
c
Notice that the inequality is non-trivial if c > EX. Proof (Discrete case):
X X X
E[X] = kp(k) ≥ kp(k) ≥ c p(k) = cP (X ≥ c).
k≥0 k≥c k≥c

Alternative format: c = kE[X] then


1
P (X ≥ kE[X]) ≤ .
k

1
Relevant for k ≥ 1.
Example: Expected life of a machine is 10 years.
10
P (L ≥ 20) ≤ = 0.5.
20

1.2 Expected Value of Functions of Random Variables


Suppose X is a discrete random variable and G is a function of X, then by definition
X
E[G(X)] = gk pG (gk )
k

where pG is the probability mass function of the random variable G(X). This requires computing pG the
probability mass function of the random variable G(X).
Example: Suppose X take values {−2, −1, 0, 1, 2} with probabilities {1/8, 1/4, 1/4, 1/4, 1/8} and Y = X 2 .
Then Y takes values {0, 1, 4} with probabilities {1/4, 1/2, 1/4}. Therefore,

E[Y ] = E[X 2 ] = 0(1/4) + 1(1/2) + 4(1/4) = 1.5.

Is there an easier way?


Proposition: If X is a discrete random variable and G is a function of X, then
X
E[G(X)] = G(xk )p(xk ), (1)
k

where p(xk ) = P (X = xk ). Example: Applying formula (1)

E[X 2 ] = (−2)2 (1/8) + (−1)2 (1/4) + 02 (1/4) + 12 (1/4) + 22 (1/8) = 4/8 + 1/4 + 0 + 1/4 + 4/8 = 1.5.

A similar formula exists for continuous random variables:


Z ∞
E[G(X)] = g(x)f (x)dx. (2)
−∞

1.3 Variance
Consider G(X) = (X − E[X])2 . Notice that G(X) is a r.v. so we can calculate its expected value.
Definition: E[(X − E[X])2 ] is known as the variance of X and written Var[X] or σ 2 .
For discrete random variables we have
X
Var[X] = E[G(X)] = (xk − E[X])2 p(xk ),
k

and for continuous random variables we have


Z ∞
Var[X] = E[G(X)] = (x − E[X])f (x)dx.
−∞

2
p
Note σ = + Var[X] is called the standard deviation of X.
Variance as a measure of spread of a distribution: Example: p(a) = p(−a) = 0.5, then E[X] = 0 and
Var[X] = a2 and σX = a. In this case the standard deviation gives us a good idea of how the distribution is
spread around the mean. Example: p(0) = 1−1/a, p(a) = 1/a. Then E[X] = 1 and Var[X] = a−1. In this
example the variance increases to infinity while the random variable becomes more and more concentrated
on zero.

2 Useful properties of Expected Values


• E(aX + b) = aEX + b
• Var(aX + b) = a2 Var[X]

• Var[X] = E[X 2 ] − (EX)2 .

2.1 Tchebychev’s Inequality


Proposition: Suppose µ = E[X] and σ 2 = Var[X] exist. Then

σ2
P (|X − µ| ≥ a) ≤ .
a2
Proof: Let Y = (X − µ)2 . Then Y ≥ 0. Therefore Markov’s inequality applies and

E[(X − µ)2 ] σ2
P ((X − µ)2 ≥ a2 ) ≤ = .
a2 a2
The result follows since P ((X − µ)2 ≥ a2 ) = P (|X − µ| ≥ a). 2.
Special case a = kσ then
1
P (|X − µ| ≥ kσ) ≤ 2 .
k
One sided Tchebychev’s inequality:
1
P (X − µ ≥ kσ) ≤ for all k > 0,
k2 +1
and
1
P (X − µ ≤ −kσ) ≤ for all k > 0.
k2 +1

3 Expectations of Functions of Random Variables


Let X and Y be two random variables, and let Z = g(X, Y ) be a real value function. Then Z is also a
random variable and we may be interested in computing E[Z]. In principle we can do this by first computing
the probability mass function (in the discrete case) or the density function (in the continuous case) of Z

3
and then use the definition of expectation. Fortunately, there is no need to do this as the following formulas
allow a direct calculation of E[Z]. In the discrete case
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y).
x y

In the continuous case Z ∞ Z ∞


E[g(X, Y )] = g(x, y)fX,Y (x, y)dxdy.
−∞ −∞

If g(X, Y ) = aX + bY the above formulas lead to

E[aX + bY ] = aE[X] + bE[Y ].

3.1 Covariance and Independence


One very important function is g(X, Y ) = (X − E[X])(Y − E[Y ]) which is the product of the difference
between the random variables and their means. The expectation of this random variable is called the
covariance of X and Y and it is denoted by Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]. It can be shown that

|Cov(X, Y )| ≤ σx σy .

The term
Cov(X, Y )
ρ=
σx σy
is known as the correlation coefficient. Notice that |ρ| ≤ 1.
Example: If X is the number of heads in the first two tosses of a fair coin and Y is the number of heads in
the first three tosses then E[X] = 1, E[Y ] = 1.5, Var[X] = 0.5, Var[Y ] = .75, Cov(X, Y ) = 0.5 and ρ = 0.82.
It is easy to see that
Cov(aX + bY, X) = aCov(X, X) + bCov(Y, X).

3.2 Independent Random Variables


Earlier we defined conditional probabilities p(x|y) = p(x, y)/pY (y) in the discrete case and f (x|y) =
f (x, y)/fY (y) in the discrete case. Just as we defined independent events we can define independent random
variables as those for which the conditional probability is equal to the unconditional probability. In other
words, knowing Y = y does not change the distribution of X, so p(x|y) = p(x) for all y such that pY (y) > 0
and similarly, f (x|y) = f (x) for all y such that fY (y) > 0. Just as we did with events, we can use a symmetric
definition and say that random variables X and Y are independent if p(x, y) = pX (x)pY (y) in the discrete
case and f (x, y) = fX (x)fY (y) in the continuous case.
The above seems to indicate that we need a separate definition of independence for discrete and continuous
random variables but this is not so as the following definition of independence encompasses both the discrete
and the continuous cases:
X and Y are said to be independent if

F (x, y) = FX (x)FY (y)

4
where F (x, y) = P (X ≤ x, Y ≤ y) is the joint cumulative distribution function and FX (x) = P (X ≤ x) and
FY (y) = P (Y ≤ y) are the marginal cumulative distribution functions.
Independence is a very important concept. It is easy to see that independent random variables are
uncorrelated, e.g., have zero correlation. To see this notice that
XX
Cov(X, Y ) = (x − E[X])(y − E[Y ])p(x, y)
x y
XX
= (x − E[X])(y − E[Y ])pX (x)pY (y)
x y
X X
= (x − E[X])pX (x) (y − E[Y ])pY (y)
x y
= (EX − EX)(EY − EY )
= 0.

However, uncorrelated random variables are not necessarily independent as shown in the following ex-
ample.
Example: Toss two coins. Let Xi be the number of heads in the ith toss. Let D = X1 −X2 and S = X1 +X2 .
Are D and S independent? No. Because if D = 0 then S 6= 1. Notice that E[D] = 0, consequently

Cov(S, D) = E[SD] − E[S]E[D] = E[SD].

Now E[SD] = E[D] = 0, so S and D are uncorrelated.

4 Variance of Linear Combinations of Random Variables


Notice that when Y is replaced by X then we get the square of the deviation of X from its mean, we obtain
Cov(X, X) = Var[X].
The variance of aX + bY can be written as

Var(aX + bY ) = Cov(aX + bY, aX + bY )


= a2 Var(X) + 2abCov(X, Y ) + b2 Cov(Y ).

Example:

Var[X − Y ] = Var[X] − 2Cov(X, Y ) + Var[Y ]


. = σx2 − 2ρσx σy + σy2 .

Example: What if σx = σy = σ? then Var[X − Y ] = 2σ 2 (1 − ρ).


Example: What if ρ = 1? Then Var[X − Y ] = 0, so X − Y is deterministic, i.e., Y = a + X for some
constant a.

You might also like