Professional Documents
Culture Documents
ch06-ProbDistribs RandomVars
ch06-ProbDistribs RandomVars
Probability Distributions
CHAPTER 6
1. Probability Distributions
In business it is often necessary to make statements along the lines of “the probability of observing a
given event is p.” We want to quantify the likelihood of certain events occurring. For example, how
likely is it that a given function of the random outcome takes a certain value?
Probability distributions list or describe probabilities for all possible outcomes of a random process.
They can be extended to answer questions about the probability that a random variable (a numerical
function of the random outcome) takes a certain value or lies in a certain range of values.
Think of a probability distribution as laying out how probability is spread among all the outcomes.
1.2. Probability distributions. The set of probabilities associated with all events or outcomes of
an uncertain quantity is called its probability distribution: this is a means of quantifying and representing
the uncertainty associated with a sample space and its events.
Thus, if S is the sample space (set of all outcomes), a probability distribution is just an assignment of
a probability to each event or outcome1 in S.
There are two types of probability distribution:
• Discrete distributions (covered in detail in the next chapter);
• Continuous distributions (covered in detail in the chapter after next).
A discrete probability distribution applies when the sample space of possible outcomes is discrete, such
as rolling dice or tossing a coin. It can be described by a discrete list of the probabilities of the outcomes,
known as a probability function or probability mass function.
A continuous probability distribution applies when the outcomes can take on values in a continuous
range (real numbers), such as the height of a person. It is usually described by a probability density
function where the probability of the outcome lying within a certain range is given by the area under
the density function within that range (and the probability of any single outcome is zero).
1We will see later that for a continuous probability distribution, the probability of any individual outcome is technically
zero: we only get nonzero probabilities for “big enough” events (subsets of outcomes).
77
78 MIS10090
Figure 6.1 shows the distinction between the two types of distribution. In a discrete distribution, prob-
abilities are given at points with distinct intervals between them; whereas in a continuous distribution,
the probabilities can be defined at any point across a range.
There are several standard probability distributions that apply in many — though not all — cases. Based
on empirical observations and theory, we may decide that one of these is most appropriate for a given
situation, e.g., Poisson distribution for number of arrivals per unit time in a queue; normal distribution
for heights of people. Some of these are discussed in subsequent chapters.
There are certain summary parameters for probability distributions such as the mean (also called ex-
pected value), standard deviation, etc.; but these are summaries: they do not tell the whole story that
the distribution does. We can use these if the outcomes are numerical, or if we can convert the outcomes
to numerical values via a random variable.
1.3. Visualising probability distributions. In Chapter 1, we used histograms to plot data sets:
they allow us visualise where the measures of central tendency occur, how variable (spread out) the data
are, and what kind of pattern or distribution the data exhibit. For many large symmetrical distributions,
there is an empirical rule that provides an estimate of the percentage of observations that are contained
within a number of units of standard deviation about the mean. This particularly applies to the normal
distribution that you will meet later, but the normal is often a good approximation to other symmetrical
distributions, under certain conditions. In the case of the normal distribution:
• Approximately 68% of the observations are contained within a symmetrically distributed band
of width one unit of standard deviation either side of the mean. We can interpret this as
saying that 68% of the population lie within one step of standard deviation from the mean. See
Figure 6.2. Or, if we randomly selected one member of the population, the probability that its
Figure 6.2. Proportions of normal observations within certain distances from the mean
You will meet this idea again when you discuss “6 sigma” in quality control during your business studies.
Six sigma means 99.9997% of observations are contained within six units of standard deviation either
side of the mean. In the case of quality control, it means that 99.9997% of manufactured products meet
quality standards, so only three out of every million fail.
1.4. Discrete probability distributions. Discrete probability distributions are defined on a dis-
crete sample space, that is, a set of outcomes separated by clear gaps.2 An important example for us is
the case of a finite number of possible values (i.e., there is a finite number of possible outcomes: a finite
sample space).
For a discrete distribution, we have a probability function or probability mass function: the function that
assigns weight i.e., probability to individual outcomes.
Example 6.1. You book a group skiing trip, hoping to get up to five other people from the class to
come along. You write X for the number that might come and assess the chances of 0, 1, 2, 3, 4 or 5
people coming:
P (X = 0) = 0.03 P (X = 3) = 0.34
P (X = 1) = 0.10 P (X = 4) = 0.25
P (X = 2) = 0.18 P (X = 5) = 0.10
Notice that total probability in a distribution adds up to 1: this follows from our basic rules of probability
(since the distribution describes the probabilities of all outcomes in the sample space). }
A barchart (somewhat like a discrete version of a histogram) can be used to display a discrete probability
distribution in graphical form: there is a bar for each possible measured numerical value and the height
of the bar is the probability of seeing that value: see Figure 6.3.
In such cases, we can express probability by talking about the probability of the variable falling within
a certain range. (Technically, this range is called an interval, e.g., [1, 2.5] is the set of all real numbers
1 and 2.5).
For example: “There is a 0.5 probability of sales falling between 65,000 and 85,000 units”.
Example 6.2. Suppose Bling is a new celebrity gossip magazine, launching in September; denote its
circulation by C.
Question: What circulation will it have by the end of the year?
The outcome C is an uncertain numerical value: thus C is a random variable. The continuous probability
distribution for the random variable C is constructed on the basis of intervals:
P (C 5, 000) = 0
P (5, 000 < C 15, 000) = 0.20
P (15, 000 < C 25, 000) = 0.45
P (25, 000 < C 33, 000) = 0.25
P (33, 000 < C 40, 000) = 0.10
When we plot a smooth curve so that the area under the curve between 5,000 and 15,000 is 0.20, the
area under the curve between 15,000 and 25,000 is 0.45, etc., we get something like Figure 6.4. This
Figure 6.4. Bling density function: P (a C b) = area under curve between a and b
is the density function for this distribution: the probability of the random variable C taking a value
between two given numbers a and b is the area under this density function curve between a and b. }
2. Random variables
We have hinted at the use of random variables but let us now treat them more fully. You will be familiar
with the idea of using a variable to represent some unknown value. In Chapter 3, we introduced the
idea of indexed notation where we have di↵erent values of a variable, perhaps representing data values
gathered as part of a survey or experiment, or data values extracted from a company’s IT system.
It is useful to be able to refer to variables derived from an outcome of an experiment. Let’s take the
example of a customer in a supermarket. We may wish to know the number of items bought by a
customer (a discrete value) or the amount of money spent by the customer (a continuous value). We
can determine these values after the customer has finished shopping. Note that these values vary from
customer to customer and cannot be determined before the customer completes their shopping, so they
are uncertain. In fact the customer’s shopping basket is the random outcome and we are interested in
numbers that are derived from this random outcome.
Data Analysis for Decision Makers 81
Definition 6.3. A random variable (also called stochastic variable) is a numerical variable X whose
value depends on chance (i.e., randomness). The value X takes can be an integer or possibly any real
number.3
Thus, a random variable is simply a numerical variable whose value is derived from the outcome of a
particular experiment. It represents a possible numerical value arising from an uncertain event. Focussing
on numerical quantities means we can carry out operations on them like addition, multiplication, etc.
It is a standard convention to use a capital letter for a random variable and a small letter for the value
it takes. For example, X might be a random variable, and it might take the particular value x.
We might want to write the mathematical expression that gives the sum of the number of items bought
by four di↵erent customers. To do this, we use indexed notation: forP
each i = 1, 2, 3, 4, let Xi represent
the number of items bought by customer i. We write the expression 4i=1 Xi = X1 + X2 + X3 + X4 for
the sum of the numbers of items bought by the four customers. Each Xi is a random variable, and they
can be combined arithmetically.
Other examples of random variables include the number of siblings a person has, the number of people
at a bus stop at a given time, or the height of a randomly-chosen person.
More generally, a random variable’s possible values might be derived from the possible outcomes of a
measurement or experiment or some other “objectively random” process (e.g., our dice example).
Note: even though X is generally a numerical-valued function, it is not a probability. The probability
measure P on S is what gives probabilities of events (of course, an event might be a single outcome).
Instead, X describes some numerical property that outcomes in S might have.
We mostly use distributions when finding the probability that a certain random variable takes a certain
value or lies in a certain range of values.
2.1. Discrete and continuous random variables. Random variables can be either discrete or
continuous:
• if the set of values a random variable X can take are separated by gaps (e.g., X can only take
integer values), we say X is a discrete random variable;
• if the set of values X can take has no gaps between its elements (there is a continuous spread
of values), we say X is a continuous random variable.
We have already been quietly using random variables: in the skiing example earlier, X was a discrete
random variable; while in the Bling example, X was a continuous random variable.
Example 6.4 (Discrete random variable). As another example, our sample space S might be the set
of outcomes of rolling two fair dice. We might define a random variable X to be the sum of the two
numbers shown on the dice. If the numbers were 3 and 4, then the value of X for that outcome would
be 3 + 4 = 7. This is a discrete random variable, with X(a, b) = a + b. }
Example 6.5 (Continuous random variable). Yet another example: the sample space S might be the
people in this class, and the random variable could be the weight of a person chosen at random. This is
a continuous random variable. }
We will often use the abbreviation DRV for a discrete random variable, and CRV for a continuous random
variable.
We will devote the two chapters following this one to particular examples of (a) discrete and (b) contin-
uous random variables that arise commonly and are particularly important. For the rest of this chapter,
we will focus on topics common to all random variables.
3More precisely, a random variable is actually a function X defined on a sample space S, X : S ! R, with the outputs of
X being numerical values. Thus, strictly speaking, the term random variable is one of the most misleading in mathematics:
it is neither random nor a variable; it is a deterministic function from a sample space to the real numbers R, which associates
with each outcome a number.
82 MIS10090
2.2. Probability of a random variable taking a given value. Each value X can take has an
associated probability. The notation P (X = x) is read as “the probability that the random variable X
has the particular value x”. You’ll sometimes see P (X = x) written as PX (x) or just P (x) for short. It
is best to use P (X = x) to avoid ambiguity.
Example 6.6. Continuing our dice Example 6.4, there are 36 possible outcomes for two six-sided dice,
each equally likely since the dice are fair. Six of these outcomes, (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
give X the value of 7. So P (X = 7) = 6/36 = 1/6. }
3.1. Mean or Expected Value of a random variable. Because random variables are numerical,
we can carry out mathematical operations on their values and thus talk about the mean, variance, etc.,
of the random variable. This mainly makes sense when carried out over all values the random variable
can take, that is, over the whole sample space.
The mean (or expected value) of a random variable is the probability-weighted average of its possible
values.
It can be denoted by µX or just µ (mu), or E(X) for a random variable X.
The expected value is the weighted average of all the possible values that the discrete random variable
can take on. We calculate the expected value E(X) as the weighted sum of the values of X, where the
weights we use are the probabilities that X will take on that value.
We will do this for discrete random variables. It can also be done for continuous random variables; but
this requires integral calculus, a tool which is beyond our scope; so we will not cover it.
Suppose we have a discrete probability distribution describing a random variable X which can take
possible values x1 , . . . , xn with associated probabilities p1 , . . . , pn ; that is, for each i, pi = P (X = xi ).
The mean is defined as the sum
X Xn
E(X) = µ = xP (X = x) = xi P (X = xi )
x i=1
= x1 P (X = x1 ) + x2 P (X = x2 ) + · · · + xn P (X = xn )
= x 1 p 1 + x 2 p2 + · · · + x n pn .
That is, sum up each possible value of X weighted by the probability of that value.
Note that, if all values of X are equally likely, we would give the same weight P (X = xi ) = 1/n to each
xi .
Example 6.7. People joining your group skiing trip
P (X = 0) = 0.03 P (X = 3) = 0.34
P (X = 1) = 0.10 P (X = 4) = 0.25
P (X = 2) = 0.18 P (X = 5) = 0.10
The mean is
0 ⇥ 0.03 + 1 ⇥ 0.10 + 2 ⇥ 0.18 + 3 ⇥ 0.34 + 4 ⇥ 0.25 + 5 ⇥ 0.10 = 2.98.
Note: even though X can only take integer (whole number) values, its mean need not be an integer. }
As we mentioned before when discussing Descriptive Statistics, if we have a data set D with no probability
distribution on it, we have no reason to assume one outcome is more likely than another, so we assign
them equal probability. For example, if the population (sample space) has size N , and we make this
“equally likely” assumption, each outcome will thus have probability (weight) 1/N (so total probability
across the sample space is N ⇥ N1 = 1 as required).
Thus, the concepts we saw before about (arithmetic) mean of a numerical attribute of data are just a
special case of what we are discussing now, where all outcomes have the same probability, e.g., given
Data Analysis for Decision Makers 83
a finite population D, the population mean µ of a numerical attribute is the arithmetic mean of this
attribute, taken over all members of the population:
N
X N
1 1 X x1 + x2 + · · · + xN
µ= xi = xi = .
N N N
i=1 i=1
The variance of a random variable X is an important measure of variation: it shows average squared
2 (sigma squared).
variation about the mean. It is denoted by Var(X) or X
To compute it: calculate the di↵erence between the mean (expected value) µ = E(X) and each possible
value xi of X and square that di↵erence; then find the probability weighted mean of these squared
di↵erences.
Once p
the variance Var(X) of X has been found, taking its square root gives the standard deviation
X = Var(X) of X.
As with the mean, we will only cover the variance for discrete random variables. It can also be done for
continuous random variables, but again requires integral calculus, which is beyond our scope.
Special case, seen before: if all values are equally likely (e.g., if we have no prior knowledge of probabil-
ities), then the population variance can be written as
N N
2 1 X 2 1 X
= Var(X) = (xi µ) where population mean µ = xi
N N
i=1 i=1
As mentioned before
Figure 6.5. Flipping two coins: outcomes (left); probability distribution table (top
right); and probability distribution barchart (bottom right)
Next we convert the frequency distribution to a probability distribution. We can work out the probabil-
ities for all the values that X may have.
For example, there were no days out of the 30 day trial when you sent zero texts, so the likelihood of
sending zero texts, X = 0, is 0/30. We say P (X = 0) = 0.
Data Analysis for Decision Makers 85
Similarly, from your sample you can calculate the probability you send 2 texts on any day: it is 4/30 or
0.133. We say P (X = 2) = 0.133.
We can represent the probability distribution as a barchart graphic: see Figure 6.6.
In deciding which is the most suitable mobile phone package for you, you need to know how many texts
you send on average per day; in this case:
E(X) = 0 ⇥ P (X = 0) + 1 ⇥ P (X = 1) + 2 ⇥ P (X = 2) + . . . + 8 ⇥ P (X = 8) = 4.37.
On average, you expect to send 4.37 texts per day. }
Exercise 6.10. Find the variance and standard deviation of the random variable X = volume of texts
sent, from Example 6.9.
Exercise 6.11. Find the variance and standard deviation of the random variable X = number of people
coming on the skiing trip with you, from Example 6.7.