You are on page 1of 13

Probability Theory Review 2

Random Variables and Distributions

Random variable (Discrete)


A random variable is a variable whose possible values are numerical outcomes of a random
phenomena

Note: ​random variables are denoted with capital letters, so when you see, say, P(X=x) that
means, “the probability that a random variable, (capital) X, is equal to some value, (lowercase)
x.”

Expected value (mean) of a discrete random variable

Variance of a discrete random variable

Example 1: A couple with two kids (X = number of boys)


Outcomes:
First child Second child X = number of boys

Boy Boy 2

Boy Girl 1

Girl Boy 1

Girl Girl 0

x P(X=x)

0 1/4

1 1/2

2 1/4
Expected value:

Variance:

Example 2: 3 biased coin tosses (X = number of heads)


P(Heads) = 2/3
P(Tails) = 1/3

Outcomes:
First toss Second toss Third toss X = number of heads

Heads Heads Heads 3

Head Heads Tails 2

Heads Tails Heads 2

Tails Heads Heads 2

Heads Tails Tails 1

Tails Heads Tails 1

Tails Tails Heads 1

Tails Tails Tails 0

x P(X=x)

1
2

Expected value:

Variance:

Random Variable (Continuous)


When a random variable is continuous, the methods we use for discrete cases break down. For
example, if we choose a number between 0 and 1 at random, P(X=x) will be 0 for all x because
there are an infinite number of possible choices. However, clearly ​some​ number must be
chosen. To deal with this we introduce the idea of ​likelihood​, which is given by a ​probability
density function​ (PDF), f​X​(x).
While we still can’t say anything interesting about probability at a point, we can use the PDF to
talk about probability over a range. To do this, we take the following integral to get the
cumulative distribution function​ (CDF), F​X​(x).

Formulas for other sorts of ranges follow straightforwardly from the definition of the CDF:

In other words, ​the probability that a continuous random variable lies in a range, a < X < b,
is equal to the area under the curve of the PDF between a and b
Note: ​Non-strict inequality (≤, ≥) and strict inequality (<, >) are interchangeable here. (We are
taking an integral so we are concerned with the area under a curve. Adding or removing a single
point doesn’t affect the area.)

Expected value (mean) of a continuous random variable

Variance of a continuous random variable

Example: Choosing a number at random between 0 and 1 (Uniform


distribution)
Because our random variable, X, follows the uniform distribution (i.e. all values are equally likely
to be chosen), our PDF is

and

Note that here the intuitive concept of all the numbers being equally likely to be chosen
corresponds neatly to all of them having equal ​likelihood​ as given by the PDF.
(Graph of PDF)

Integrating to get the CDF yields the following

A couple examples of calculating probabilities using the CDF:

This is intuitively obvious because if we’re choosing a number between 0 and 1, we’ll never
choose a number less than -2

If we’re choosing a number between 0 and 1, then X<5 covers all possible choices, so it makes
sense that X<5 has probability 1
Note: ​It’s not a coincidence that we got F​X​(x) = 1 for x > 1, our PDF was defined so that this
would happen. If we were picking numbers from a different range we would have to define our
PDF differently. As we saw previously, the area under the PDF corresponds to the probability
that X will be in a given range. Therefore the total area under the PDF must equal 1 because X
must be ​some​ number.
Stated more formally, for all PDFs we must have

Expectation:

Variance:

Independent Random Variables


Two discrete random variables, X and Y are defined to be independent if the following is true:

Intuitively, this is the idea that the value of one random variable doesn’t affect the value of the
other

Properties of Expectation and Variance


(In the following equations, all k are constant values)

The following is only true is X and Y are independent:


Binomial Distribution: X~Bi(n,p)
The binomial distribution is the distribution we’ve been using in our discrete cases thus far. It
describes a series of trials where there are two possible outcomes with fixed probabilities (such
as a coin flip). This distribution has two parameters:

n - the number of trials


p - the probability of ‘success’

For example, if we were tossed a biased coin 10 times, and the probability of getting heads on
this coin was ⅓, we would say the number of heads followed Bi(10,⅓)

(For worked out examples, see the section on discrete random variables)

Poisson Distribution: X~Pois(λ)


The Poisson distribution is used to describe how many times an event will occur in a given time
period when we know how many times it happens on average. For example, if we know that a
website gets an average of 3 views every minute, we could use the Poisson distribution to
determine the probability that it will get, say, 5 views, or 100 views in that amount of time.
The Poisson distribution takes a single parameter, λ, which is the average number of
occurrences in the given time frame.

Also, if we have two independent random variables, X​1​~Pois(λ​1​) and X​2​~Pois(λ​2​) then
(X​1​+X​2​)~Pois(λ​1​+λ​2​)

Example: A website that gets 3 views every minute


Let’s say that we are running a website that gets on average 3 views every minute and we want
to use that information to get an idea of how likely we are to get other amounts of traffic. The
number of views per minute follows Pois(3), so from there we just plug whatever value we want
into the formula.
# of views, k P(X=k)

...

10

Normal Distribution: X~N(μ,σ​2​)


The normal distribution is the familiar ‘bell curve’ distribution. It takes two parameters:
μ - the mean of the distribution
σ​2​ ​- the variance of the distribution

PDF:

Note: ​exp(x) is the same as e​x​. exp(x) is just easier to read when the exponent starts getting
complicated
(Graph of PDF, note that it is symmetric around x=μ)

CDF:

Note: ​this integral can’t be evaluated using elementary operations, see the section on
the standard normal distribution. It does however have the following properties:

(one sigma)

(two sigma)

(three sigma)

Expectation:

(given as parameter)

Variance:

(given as parameter)

Standard Normal Distribution: N(μ=0,σ​2​=1)


As mentioned above, the CDF for the normal distribution is difficult to calculate. To get around
this, we transform the distribution we’re working with to the standard normal distribution. From
there we can use a table to look up the value of the CDF for the standard normal distribution, ϕ.
To transform any normal distribution, N(μ, σ​2​), to the standard normal distribution, N(0,1) we use
the following procedure:

Given a random variable, X~N(μ, σ​2​), we define Y such that

To calculate P(X<k), we use the following equation

And we use a table to look up the appropriate value for ϕ to get our answer

Example:
Let’s say we have a random variable X~N(10, 4) and we want to calculate the probability
P(X<12). First, we have to define our transformation:

Then we find the corresponding value in the standard normal distribution:

Central Limit Theorem


A quick definition before we dive into the theorem:
A set of random variables, {X​1​, X​2​, …, X​n​}, is ​independent and identically distributed
(i.i.d.) if they are all mutually independent and they all follow the same distribution.

The central limit theorem states that given a set of n i.i.d. random variables, {X​1​, X​2​, …, X​n​},
which are identical to some X with E(X) = μ and Var(X) = σ​2​, then as n approaches infinity we
have

Note:​ ​this is an APPROXIMATION​ ​as n approaches infinity.

(See examples below)

Markov Inequality
For a non-negative random variable, X ≥ 0, we can use the following inequality
Note: ​the upside-down A here simply reads as ‘for all.’ In other words, this is true for all t > 0

(See examples below)

Chebyshev Inequality

(see examples below)

Examples

Example 1: Biased coin toss


Let’s say we have a biased coin with P(Heads) = ⅓ and P(Tails) = ⅔ which we toss ten times
and we want to find P(X>2*10*⅓), where X is the number of heads. There are a number of
different ways we could go about it.

Direct calculation
We have X~Bi(10,⅓) which must be an integer 0≤X≤10, so we have

Quite a bit of math, even with a relatively small set of possible values. It would be impractical to
calculate the exact value if there were a very large number of values, which is why we have
approximation/bounding techniques like the following.

Central limit theorem


Let’s define T​i​~Bi(1,⅓) as the number of heads in the ith coin toss (i.e. T​i​ = 1 if the ith toss came
up heads and T​i​ = 0 if it came up tails). Then we have
Because {T​1​, T​2​, ..., T​n​} is independent and identically distributed, we can apply the central limit
theorem.

Then all we have to do is transform to the standard normal distribution

Finally,

Markov Inequality
X ≥ 0 (we can never have a negative number of heads), so we can use the Markov inequality to
give us an upper bound here

Chebyshev Inequality
Example 2: Testing algorithm runtime
Let’s say that we have an algorithm with runtime, T. We know the variance, but we don’t know
the mean. We want to perform a series of sample runs, {T​1​, T​2​, …, T​n​}, to approximate the
mean, but we want to know how many runs we should perform to be reasonably sure we have a
good approximation. How many runs do we need to do to for there to be a 95% chance that the
population mean is within 0.5 of the sample mean?
Given:

We need to find n such that:

By using an equivalent expression for the range we can get the following

And by swapping the direction of the inequalities we get

Which looks an awful lot like the Chebyshev inequality.

So it ​is​ the right form for Chebyshev inequality! T is a continuous random variable, so
strict/non-strict inequality is irrelevant here.

From there all we have to do is find an integer value for n that satisfies the following:

Basic algebra yields the answer

You might also like