Professional Documents
Culture Documents
A variable is a symbol (A, B, x, y, etc.) that can take on any of a specified set of
values.
When the value of a variable is the outcome of a statistical experiment, that variable is
a random variable.
Generally, statisticians use a capital letter to represent a random variable and a lower-case
letter, to represent one of its values. For example,
P(X = x) refers to the probability that the random variable X is equal to a particular
value, denoted by x. As an example, P(X = 1) refers to the probability that the random
variable X is equal to 1.
Probability Distributions
An example will make clear the relationship between random variables and probability
distributions. Suppose you flip a coin two times. This simple statistical experiment can have
four possible outcomes: HH, HT, TH, and TT. Now, let the variable X represent the number
of Heads that result from this experiment. The variable X can take on the values 0, 1, or 2. In
this example, X is a random variable; because its value is determined by the outcome of a
statistical experiment.
A probability distribution is a table or an equation that links each outcome of a statistical
experiment with its probability of occurrence. Consider the coin flip experiment described
above. The table below, which associates each outcome with its probability, is an example of
a probability distribution.
Number of heads
0
1
2
Probability
0.25
0.50
0.25
The above table represents the probability distribution of the random variable X.
Probability:
P(X = x)
0.25
0.50
0.25
Cumulative Probability:
P(X < x)
0.25
0.75
1.00
Example 1
Suppose a die is tossed. What is the probability that the die will land on 5 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2, 3, 4,
5, 6 }. Each possible outcome is a random variable (X), and each outcome is equally likely to
occur. Thus, we have a uniform distribution. Therefore, the P(X = 5) = 1/6.
Example 2
Suppose we repeat the dice tossing experiment described in Example 1. This time, we ask
what is the probability that the die will land on a number that is smaller than 5 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2, 3, 4,
5, 6 }. Each possible outcome is equally likely to occur. Thus, we have a uniform distribution.
This problem involves a cumulative probability. The probability that the die will land on a
number smaller than 5 is equal to:
P( X < 5 ) = P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 1/6 + 1/6 + 1/6 + 1/6 = 2/3
All probability distributions can be classified as discrete probability distributions or as
continuous probability distributions, depending on whether they define probabilities
associated with discrete variables or continuous variables.
Suppose the fire department mandates that all fire fighters must weigh between 150
and 250 pounds. The weight of a fire fighter would be an example of a continuous
variable; since a fire fighter's weight could take on any value between 150 and 250
pounds.
Suppose we flip a coin and count the number of heads. The number of heads could be
any integer value between 0 and plus infinity. However, it could not be any number
between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the
number of heads must be a discrete variable.
Probability
0.25
0.50
0.25
The above table represents a discrete probability distribution because it relates each value of
a discrete random variable with its probability of occurrence. In subsequent lessons, we will
cover the following discrete probability distributions.
Note: With a discrete probability distribution, each possible value of the discrete random
variable can be associated with a non-zero probability. Thus, a discrete probability
distribution can always be presented in tabular form.
The probability that a continuous random variable will assume a particular value is
zero.
Most often, the equation used to describe a continuous probability distribution is called a
probability density function. Sometimes, it is referred to as a density function, a PDF, or a
pdf. For a continuous probability distribution, the density function has the following
properties:
Since the continuous random variable is defined over a continuous range of values
(called the domain of the variable), the graph of the density function will also be
continuous over that range.
The area bounded by the curve of the density function and the x-axis is equal to 1,
when computed over the domain of the variable.
The probability that a random variable assumes a value between a and b is equal to
the area under the density function bounded by a and b.
For example, consider the probability density function shown in the graph below. Suppose we
wanted to know the probability that the random variable X was less than or equal to a. The
probability that X is less than or equal to a is equal to the area under the curve bounded by a
and minus infinity - as indicated by the shaded area.
Note: The shaded area in the graph represents the probability that the random variable X is
less than or equal to a. This is a cumulative probability. However, the probability that X is
exactly equal to a would be zero. A continuous random variable can take on an infinite
number of values. The probability that it will equal a specific value (such as a) is always zero.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a binomial
experiment. The probability distribution of a binomial random variable is called a binomial
distribution.
Suppose we flip a coin two times and count the number of heads (successes). The binomial
random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial
distribution is presented below.
Number of heads
0
1
2
Probability
0.25
0.50
0.25
The binomial probability refers to the probability that a binomial experiment results in
exactly x successes. For example, in the above table, we see that the binomial probability of
getting exactly one head in two coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the binomial formula:
Binomial Formula. Suppose a binomial experiment consists of n trials and results in x
successes. If the probability of success on an individual trial is P, then the binomial
probability is:
b(x; n, P) = nCx * Px * (1 - P)n - x
or
b(x; n, P) = { n! / [ x! (n - x)! ] } * Px * (1 - P)n - x
Example 1
Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the
number of successes is equal to 2, and the probability of success on a single trial is 1/6 or
about 0.167. Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
We need to start to solve this problem by using the Binomial distribution to find
the probability that a packet of 10 blades as zero defective or two defective
blades.
Let X be the number of defective blades in a packet. X has the binomial
distribution with n = 10 trials and success probability p = 0.002 .
In general, if X has the binomial distribution with n trials and a success
probability of p then
P[X = x] = n!/(x!(n-x)!) * p^x * (1-p)^(n-x)
for values of x = 0, 1, 2, ..., n
P[X = x] = 0 for any other value of x.
The probability mass function is derived by looking at the number of combination
of x objects chosen from n objects and then a total of x success and n - x failures.
Or, to be more accurate, the binomial is the sum of n independent and identically
distributed Bernoulli trials.
X ~ Binomial( n , p )
the mean of the binomial distribution is n * p = 0.02
the variance of the binomial distribution is n * p * (1 - p) = 0.01996
the standard deviation is the square root of the variance = ( n * p * (1 - p)) =
0.1412799
P( X = 0 ) = 0.980179043351949
P( X = 2 ) = 0.0001771400795612779
the Poisson approximation to the binomial works with n is large and p is small.
Here you have n = 10 and p = 0.002.
Let Y be the number of defects in a packet of blades. Y has the Poisson
distribution, approximately, with parameter 0.02.
In general you have:
Y ~ Poisson( t )
Sampling Distributions
Suppose that we draw all possible samples of size n from a given population. Suppose further
that we compute a statistic (e.g., a mean, proportion, standard deviation) for each sample. The
probability distribution of this statistic is called a sampling distribution. And the standard
deviation of this statistic is called the standard error.
If the population size is much larger than the sample size, then the sampling distribution has
roughly the same standard error, whether we sample with or without replacement. On the
other hand, if the sample represents a significant fraction (say, 1/20) of the population size,
the standard error will be meaningfully smaller, when we sample without replacement.
and
x = [ / sqrt(n) ] * sqrt[ (N - n ) / (N - 1) ]
In the standard error formula, the factor sqrt[ (N - n ) / (N - 1) ] is called the finite population
correction or fpc. When the population size is very large relative to the sample size, the fpc is
approximately equal to one; and the standard error formula can be approximated by:
x = / sqrt(n).
You often see this "approximate" formula in introductory statistics texts. As a general rule, it
is safe to use the approximate formula when the sample size is no bigger than 1/20 of the
population size.
Requirements for accuracy. The more closely the sampling distribution needs to
resemble a normal distribution, the more sample points will be required.
The shape of the underlying population. The more closely the original population
resembles a normal distribution, the fewer sample points will be required.
In practice, some statisticians say that a sample size of 30 is large enough when the
population distribution is roughly bell-shaped. Others recommend a sample size of at least
40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple
peaks, and/or has outliers), researchers like the sample size to be even larger.
If the sample size is large, use the normal distribution. (See the discussion above in
the section on the Central Limit Theorem to understand what is meant by a "large"
sample.)
In practice, researchers employ a mix of the above guidelines. On this site, we use the normal
distribution when the population standard deviation is known and the sample size is large. We
might use either distribution when standard deviation is unknown and the sample size is very
large. We use the t-distribution when the sample size is small, unless the underlying
distribution is not normal. The t distribution should not be used with small samples from
populations that are not approximately normal.
The normal calculator solves common statistical problems, based on the normal distribution.
The calculator computes cumulative probabilities, based on three simple inputs. Simple
instructions guide you to an accurate solution, quickly and easily. If anything is unclear,
frequently-asked questions and sample problems provide straightforward explanations. The
calculator is free. It can be found under the Stat Tables tab, which appears in the header of
every Stat Trek web page.
Normal Calculator
Example 1
Assume that a school district has 10,000 6th graders. In this district, the average weight of a
6th grader is 80 pounds, with a standard deviation of 20 pounds. Suppose you draw a random
sample of 50 students. What is the probability that the average weight of a sampled student
will be less than 75 pounds?
Solution: To solve this problem, we need to define the sampling distribution of the mean.
Because our sample size is greater than 30, the Central Limit Theorem tells us that the
sampling distribution will approximate a normal distribution.
To define our normal distribution, we need to know both the mean of the sampling
distribution and the standard deviation. Finding the mean of the sampling distribution is easy,
since it is equal to the mean of the population. Thus, the mean of the sampling distribution is
equal to 80.
The standard deviation of the sampling distribution can be computed using the following
formula.
x = [ / sqrt(n) ] * sqrt[ (N - n ) / (N - 1) ]
x = [ 20 / sqrt(50) ] * sqrt[ (10,000 - 50 ) / (10,000 - 1) ] = (20/7.071) * (0.995) = 2.81
Let's review what we know and what we want to know. We know that the sampling
distribution of the mean is normally distributed with a mean of 80 and a standard deviation of
2.82. We want to know the probability that a sample mean is less than or equal to 75 pounds.
Because we know the population standard deviation and the sample size is large, we'll use the
normal distribution to find probability. To solve the problem, we plug these inputs into the
Normal Probability Calculator: mean = 80, standard deviation = 2.81, and normal random
variable = 75. The Calculator tells us that the probability that the average weight of a sampled
student is less than 75 pounds is equal to 0.038.
Note: Since the population size is more than 20 times greater than the sample size, we could
have used the "approximate" formula x = [ / sqrt(n) ] to compute the standard error. Had
we done that, we would have found a standard error equal to [ 20 / sqrt(50) ] or 2.83.
Example 2
Find the probability that of the next 120 births, no more than 40% will be boys. Assume equal
probabilities for the births of boys and girls. Assume also that the number of births in the
population (N) is very large, essentially infinite.
Solution: The Central Limit Theorem tells us that the proportion of boys in 120 births will be
approximately normally distributed.
The mean of the sampling distribution will be equal to the mean of the population
distribution. In the population, half of the births result in boys; and half, in girls. Therefore,
the probability of boy births in the population is 0.50. Thus, the mean proportion in the
sampling distribution should also be 0.50.
The standard deviation of the sampling distribution (i.e., the standard error) can be computed
using the following formula.
p = sqrt[ PQ/n ] * sqrt[ (N - n ) / (N - 1) ]
Here, the finite population correction is equal to 1.0, since the population size (N) was
assumed to be infinite. Therefore, standard error formula reduces to:
p = sqrt[ PQ/n ]
p = sqrt[ (0.5)(0.5)/120 ] = sqrt[0.25/120 ] = 0.04564
Let's review what we know and what we want to know. We know that the sampling
distribution of the proportion is normally distributed with a mean of 0.50 and a standard
deviation of 0.04564. We want to know the probability that no more than 40% of the sampled
births are boys.
Because we know the population standard deviation and the sample size is large, we'll use the
normal distribution to find probability. To solve the problem, we plug these inputs into the
Normal Probability Calculator: mean = .5, standard deviation = 0.04564, and the normal
random variable = .4. The Calculator tells us that the probability that no more than 40% of
the sampled births are boys is equal to 0.014.
Note: This problem can also be treated as a binomial experiment. Elsewhere, we showed how
to analyze a binomial experiment. The binomial experiment is actually the more exact
analysis. It produces a probability of 0.018 (versus a probability of 0.14 that we found using
the normal distribution). Without a computer, the binomial approach is computationally
demanding. Therefore, many statistics texts emphasize the approach presented above, which
uses the normal distribution to approximate the binomial.
Suppose we have two populations with proportions equal to P1 and P2. Suppose further that
we take all possible samples of size n1 and n2. And finally, suppose that the following
assumptions are valid.
The size of each population is large relative to the sample drawn from the population.
That is, N1 is large relative to n1, and N2 is large relative to n2. (In this context,
populations are considered to be large if they are at least 20 times bigger than their
sample.)
The samples from each population are big enough to justify using a normal
distribution to model differences between proportions. The sample sizes will be big
enough when the following conditions are met: n1P1 > 10, n1(1 -P1) > 10, n2P2 > 10,
and n2(1 - P2) > 10. (This criterion requires that at least 40 observations be sampled
from each population. When P1 or P1 is more extreme than 0.5, even more
observations are required.)
The samples are independent; that is, observations in population 1 are not affected by
observations in population 2, and vice versa.
The expected value of the difference between all possible sample proportions is equal
to the difference between population proportions. Thus, E(p1 - p2) = P1 - P2.
It is straightforward to derive the last bullet point, based on material covered in previous
lessons. The derivation starts with a recognition that the variance of the difference between
independent random variables is equal to the sum of the individual variances. Thus,
2d = 2P1 - P2 = 21 + 22
If the populations N1 and N2 are both large relative to n1 and n2, respectively, then
21 = P1(1 - P1) / n1
And
22 = P2(1 - P2) / n2
Therefore,
2d = [ P1(1 - P1) / n1 ] + [ P2(1 - P2) / n2 ]
And
d = sqrt{ [ P1(1 - P1) / n1 ] + [ P2(1 - P2) / n2 ] }
In this section, we work through a sample problem to show how to apply the theory presented
above. In this example, we will use Stat Trek's Normal Distribution Calculator to compute
probabilities. The calculator is free.
Make sure the samples from each population are big enough to model differences with
a normal distribution. Because n1P1 = 100 * 0.52 = 52, n1(1 - P1) = 100 * 0.48 = 48,
n2P2 = 100 * 0.47 = 47, and n2(1 - P2) = 100 * 0.53 = 53 are each greater than 10, the
sample size is large enough.
Find the mean of the difference in sample proportions: E(p1 - p2) = P1 - P2 = 0.52 0.47 = 0.05.
Find the probability. This problem requires us to find the probability that p1 is less
than p2. This is equivalent to finding the probability that p1 - p2 is less than zero. To
find this probability, we need to transform the random variable (p1 - p2) into a z-score.
That transformation appears below.
zp1 - p2 = (x - p1 - p2) / d = = (0 - 0.05)/0.0706 = -0.7082
Using Stat Trek's Normal Distribution Calculator, we find that the probability of a zscore being -0.7082 or less is 0.24.
Therefore, the probability that the survey will show a greater percentage of Republican voters
in the second state than in the first state is 0.24.
Note: Some analysts might have used the t-distribution to compute probabilities for this
problem. We chose the normal distribution because the population variance was known and
the sample size was large. In a previous lesson, we offered some guidelines for choosing
between the normal and the t-distribution.
The size of each population is large relative to the sample drawn from the population.
That is, N1 is large relative to n1, and N2 is large relative to n2. (In this context,
populations are considered to be large if they are at least 10 times bigger than their
sample.)
The samples are independent; that is, observations in population 1 are not affected by
observations in population 2, and vice versa.
The set of differences between sample means is normally distributed. This will be true
if each population is normal or if the sample sizes are large. (Based on the central
limit theorem, sample sizes of 40 would probably be large enough).
The expected value of the difference between all possible sample means is equal to
the difference between population means. Thus, E(x1 - x2) = d = 1 - 2.
The standard deviation of the difference between sample means (d) is approximately
equal to:
d = sqrt( 12 / n1 + 22 / n2 )
It is straightforward to derive the last bullet point, based on material covered in previous
lessons. The derivation starts with a recognition that the variance of the difference between
independent random variables is equal to the sum of the individual variances. Thus,
2d = 2 (x1 - x2) = 2 x1 + 2 x2
If the populations N1 and N2 are both large relative to n1 and n2, respectively, then
2 x1 = 21 / n1
2 x2 = 22 / n2
And
Therefore,
d2 = 12 / n1 + 22 / n2
And
d = sqrt( 12 / n1 + 22 / n2 )
(C) 0.045
(D) 0.055
(E) None of the above
Solution
The correct answer is B. The solution involves three or four steps, depending on whether you
work directly with raw scores or z-scores. The "raw score" solution appears below:
Find the mean difference (male absences minus female absences) in the population.
d = 1 - 2 = 15 - 10 = 5
Find the probability. This problem requires us to find the probability that the average
number of absences in the boy sample minus the average number of absences in the
girl sample is less than 3. To find this probability, we use Stat Trek's Normal
Distribution Calculator. Specifically, we enter the following inputs: 3, for the normal
random variable; 5, for the mean; and 1.1, for the standard deviation. We find that the
probability of the mean difference (male absences minus female absences) being 3 or
less is about 0.035.
Thus, the probability that the difference between samples will be no more than 3 days is
0.035.
Alternatively, we could have worked with z-scores (which have a mean of 0 and a standard
deviation of 1). Here's the z-score solution:
Find the mean difference (male absences minus female absences) in the population.
d = 1 - 2 = 15 - 10 = 5
Find the z-score that is produced when boys have three more days of absences than
girls. When boys have three more days of absences, the number of male absences
minus female absences is three. And the associated z-score is
z = (x - )/ = (3 - 5)/1.1 = -2/1.1 = -1.818
Find the probability. To find this probability, we use Stat Trek's Normal Distribution
Calculator. Specifically, we enter the following inputs: -1.818, for the normal random
variable; 0, for the mean; and 1, for the standard deviation. We find that the
probability of probability of a z-score being -1.818 or less is about 0.035.
Of course, the result is the same, whether you work with raw scores or z-scores.
Note: Some analysts might have used the t-distribution to compute probabilities for this
problem. We chose the normal distribution because the population variance was known and
the sample size was large. In a previous lesson, we offered some guidelines for choosing
between the normal and the t-distribution.