You are on page 1of 114

Republic of the Philippines

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES


Office of the Vice President for Branches and Campuses
Biñan Campus

Instructional Materials
In
STAT 20053

Statistical Analysis
with Software Application

Compiled by:

Israel G. Ortega, LPT


Faculty, PUP Biñan Campus
MODULE 1 – STATISTICAL MODELS AND UNCERTAINTIES

OVERVIEW:
You are taking part in a gameshow. The host of the show, who is known as Monty, shows
you three outwardly identical doors. Behind one of them is a prize (a sports car), and behind the
other two are goats. You are asked to select, but not open, one of the doors. After you have done
so, Monty, who knows where the prize is, opens one of the two remaining doors. He always opens
a door he knows will reveal a goat, and randomly chooses which door to open when he has more
than one option (which happens when your initial choice contains the prize). After revealing a
goat, Monty gives you the choice of either switching to the other unopened door or sticking with
your original choice. You then receive whatever is behind the door you choose. What should you
do, assuming you want to win the prize?

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Identify the Monty Hall problem.
2. Explain decision making under uncertainty and uncertainty in the news.
3. Discuss simplicity and complexity needs for models.
4. Perform the process of model building and making assumptions.

COURSE MATERIALS:

1.1 The Monty Hall Problem

The famous Monty Hall Problem is a classic example of decision making under
uncertainty. In module 2, we will solve this problem formally, but for now appreciate that at each
round of the game you, as the player, do not know where the sports car is. To begin with, the only
certainty you have is that the sports car must be behind one of the three doors. You may, or may
not, initially chose the `correct' door (assuming you want to win the prize!) but there is no certainty
in your choice. Upon revealing a goat behind one of the doors you did not choose, you still face
uncertainty (the only certainty you have is that the sports car must be behind one of the two
unopened doors). The “controversy” arose over the American game show `Let's Make a Deal',
and the New York Times (among others) devoted two pages to the problem, readers' letters etc.

Bewildered game show players wrote to Marilyn vos Savant, an advice columnist for
Parade Magazine, and asked for her opinion in her `Ask Marilyn' column. Savant – who is credited
by the Guinness Book of Records as having the highest IQ of any woman in the world – gave her
decision. She said, “You should change your choice”. There then followed a long argument in the
correspondence columns, some supporting Savant's decision and others saying that it was
nonsense. What do you think, and why?

1
1.2 Decision Making Under Uncertainty

To study, or not to study? To invest, or not to invest? To marry, or not to marry? These,
among others, are decisions many of us face during our lives. Of course, decisions have to be
taken in the present, with uncertain future outcome.
In the workplace, for example, making decisions is the most important job of any executive.
However, it is also the toughest and riskiest job. Bad decisions can damage a business, a
reputation and a career, sometimes irreparably. Good decisions can result in promotion, a strong
reputation and making money! Today we are living in the age of technology, with two important
implications for everyone.

1. Technology has made it possible to collect vast amounts of data – the era of `big data'.
2. Technology has given many more people the power and responsibility to analyse data and
make decisions on the basis of quantitative analysis.

A large amount of data already exists, and it will only increase further in the future. Many
companies, rightly, are seeing data-driven decision-making as a source of competitive advantage.
By using quantitative methods to uncover and extract the information in the data and then acting
on this information – guided by quantitative analysis – they are able to gain advantages which
their more qualitatively – oriented competitors are not able to gain.

Today, demand for people with quantitative skills far exceeds supply, creating a skills
deficit. With demand set to increase further, and supply failing to keep pace with demand, then
Economics 101 will tell you that the price increases whenever demand exceeds supply. Of course,
the “price” being referred to here is that of an employee, i.e. the salary which quantitative staff
can command (already high) is set to rise even further. Decision-making is a process when one
is faced with a problem or decision having more than one possible outcome.

The possible results from the decision are a function of both internal variables (which we
can control) and external variables (which we cannot control), each of which cannot be
expressed with certainty. Hence the outcome cannot be known in advance with certainty. When
evaluating all decision-making, we start with structuring the problem.

Example: What price should we charge for our new product?

Determine the set of possible alternatives, for example:


 P100.00
 P200.00
 P300.00
 P400.00

Determine the possible criteria which could be used to evaluate the alternatives:
 qualitative analysis
 quantitative analysis

An example of a qualitative analysis:


 “Well, last time we brought a new product to the market, we priced it at P200.00 and we
sold out on the first day. This time let's price it higher.”

2
This is all very well, but how much “higher”? How would we justify a specific increase of Px?

An example of a quantitative analysis:


 “What do we know about current market demand?”
 “What do we know about competitive market factors?”
 “Where will we manufacture the new product and how much will it cost to bring it to the
market?”
 “How will we advertise and how much will the advertising cost?”
 …

Note this is not an exhaustive list, but clearly market demand, competition, production
costs and advertising expenditure (among other factors) are likely to be relevant to the price-
setting problem.
For all decisions, we need to determine the influencing factors which could either be internal or
external, such as:
 demand and competitive supply
 availability of labour and materials
 …

which are then used to derive expected results or consequences. Of course, determining which
are the influencing factors, and their corresponding weights of influence, is not necessarily easy,
but a thoughtful consideration of these is important due to their cumulative effect on the outcome.

In a qualitative analysis, once we have determined a preliminary list of the factors which
we think will affect the possible outcomes of the decision:
 the management team “qualitatively” evaluates how each factor could affect the decision
 this discussion leads to an assessment by the decision-maker
 the decision is made followed by implementation, if necessary.

For example, in a qualitative analysis we might describe the potential options in a decision
tree (covered in module 6) in which we can include the concept of probable outcomes. We could
make this assessment using the (qualitative) qualifiers of:
 optimistic
 conservative
 pessimistic

However, a qualitative approach inevitably is susceptible to judgment and hence biases


on the part of the decision-makers. “Gut instinct” can lead to good outcomes, but in the long run
is far from optimal. In a quantitative analysis, once we have determined a preliminary list of the
factors which we think will affect the possible outcomes of the decision, we need to ask the
following questions.
 What do we know?
 What data can we “mine” which will help us understand the factors and the effect each will
have on the possible outcomes?

In a quantitative analysis, the evaluation becomes a process of using mathematics and


statistical techniques. These are used to find predictive relationships between the factors, the
potential outcomes of the problem we are seeking to understand and the decision we are seeking
to make.

3
Our objective becomes to define mathematically the relationships which might exist. Next,
we evaluate the significance of the predictive value of the relationships found. An assessment of
the relationships which our analysis defines leads us to be able to quantitatively express the
expected results or consequences of the decision we are making.

1.3 Uncertainty in the News

“News” is not “olds”. News reports new information about events taking place in the world
– ignoring fake news! Especially in business news, you will find numerous reports discussing the
many uncertainties being faced by business. While uncertainty certainly makes life exciting, it
makes decision-making particularly challenging. Should a form increase production? Advertise?
Cut back? Merge?

Decisions are made in the present, with uncertain future outcomes. Hence many media
reports will comment on the uncertainties being faced. Of course, some eras are more uncertain
than others. Indeed, 2016 was the year of the black swan – low-probability, high-impact events
– with the Brexit referendum vote and the election of Donald Trump to the White House the main
geopolitical stories. Both outcomes were considered unlikely, yet they both happened. Some
prediction markets priced in a 25% probability of each of these outcomes, but a simple probability
calculation (as will be covered next module) would equate this to tossing a fair coin twice and
getting two heads – perhaps these were not such surprising results after all! Taking Brexit as an
example, immediately after the referendum result was known, uncertainty arose about exactly
what `Brexit' meant. Exiting the single market? Exiting the customs union?

Financial markets, in particular, tend to be very sensitive to news. Even stories reporting
comments from influential people, such as politicians, can move markets – sometimes
dramatically! For example, read `Flash crash sees the pound gyrate in Asian trading' available at
http://www.bbc.co.uk/news/business-37582150.

At one stage it fell as much as 6% to $1.1841 – the biggest move since the Brexit Vote –
before recovering. It was recently trading 2% lower at $1.2388. It is not clear what triggered the
sudden sell-off. Analysts say it could have been automated trading systems reacting to a news
report. The Bank of England said it was ‘looking into’ the ash crash. The sharp drop came after
the Financial Times published a story online about French President Francois Hollande
demanding ‘tough Brexit negotiations’.

Increasingly, quantitative hedge funds and asset managers will trade algorithmically, with
computers designed to scan the internet for news stories and interpret whether news reports
contain any useful information which would allow a revision of probabilistic beliefs (we have
already seen an example of this with the Monty Hall problem, and will formally consider ‘Bayesian
updating’ in module 2). Here, the demand for ‘tough Brexit negotiations’ by the then French
President would be interpreted as being bad for the UK, which would lead to a further depreciation
in the pound sterling.
“These days some algos trade on the back of news sites, and even what is trending on
social media sites such as Twitter, so a deluge of negative Brexit headlines could have led to an
algo taking that as a major sell signal for the pound,” says Kathleen Brooks, research director at
City Index.

So, from now on, when you read (or listen to) the news, keep an eye out for (or ear open
to!) the word ‘uncertainty’ and consider what kinds of decisions are being made in the face of the
uncertainty.

4
1.4 Simplicity vs. Complexity – The Need for Models

Is the real world:

a. Nice, simple and easy?


b. Big, horrible and complicated?

Answer, b!

Although we care about the real world, seek to understand it and make decisions in reality,
we have an inherent dislike of complexity. Indeed, in the social sciences the real world is a highly
complex web of interdependencies between countless variables.

For example, in economics, what determines the economic performance of a country?


From national income accounting in Economics 101 you might say consumption, investment,
government spending and net exports, but what affects, say, consumption? Consumer
confidence? Perhaps, but what drives consumer confidence? Consumers' incomes? Consumers'
inflationary expectations? Fears of job insecurity? Perceived level of economic competency of the
government. The weather? Etc.

So in order to make any sense of the real world, we will inevitably have to simplify reality.
Our tool for achieving this is a model. A model is a deliberate simplification of reality. A good
model retains the most important features of reality and ignores less important details.
Immediately we see that we face a trade-off (an opportunity cost in Economics 101). The benefit
of a model is that we simplify the complex real world. The cost of a model is that the consequence
of this simplification of reality is a departure from reality. Broadly speaking, we would be happy if
the benefit exceeded the cost, i.e. if the simplicity made it easier for us to understand and analyse
the real world while incurring only a minimal departure from reality.

Example: The London Underground map

The world-famous London Underground map is an excellent example of a model for


getting from point A to point B. The map contains the most important pieces of information for
reaching your intended destination:
 distinct names and colors for each line
 the order of stations on each line
 the interchange stations between lines
while less important details are ignored, such as:
 the depth of each tunnel
 the exact distance between stations
 the non-linear nature of the tunnels under the ground.

Of course, an engineer would likely need to know these ‘less important details’, but for a
tourist visiting London such information is superfluous and the map is very much fit-for-purpose.
However, we said above that a model is a departure from reality, hence some caution should
always be exercised when using a model. Blind belief in a model might be misleading. For
example, the map above fails to accurately represent the precise geographic location of stations.
If we look at the geographically-accurate map we see the first map can be very misleading in
terms of the true distance between stations { for example, the quickest route from Edgware Road
to Marble Arch is to walk! Also, even line names can be a model – the Circle line (in yellow) is

5
clearly not a true circle! Does it matter? Well, the Circle line forms a loop and it is an easy name
to remember, so arguably here the simplification of the name outweighs the slight departure in
reality from a true circle!

Our key takeaway is that models inevitably involve trade-offs. As we further simplify reality
(a benefit), we further depart from reality (a cost). In order to determine whether or not a model is
“good”, we must decide whether the benefit justifies the cost. Resolving this benefit–cost trade-
off is subjective – further adding to life's complexities.

1.5 Assumptions

We have defined a model to be a deliberate simplification of reality. To assist with the


process of model building, we often make assumptions – usually simplifying assumptions.
Returning to the (geographically-accurate) London Underground map, the Circle line (in yellow)
is not a perfect geometric circle, but here it is reasonable to assume the line behaves like a circle
as it does go round in a loop. So adopting the name ‘Circle line’ assumes its path closely
approximates a circle. I do not think anyone would seriously suggest the name ‘Circle line’ is
inappropriate! Moving to statistical models, we often make distributional assumptions, i.e. we
assume a particular probability distribution (a concept introduced next module) for a particular
variable. In due course we will meet the normal distribution, the familiar bell-shaped curve:

Figure 1. Normal Distribution

The normal distribution is frequently-used in models. One example is that financial returns
on assets are often assumed to be normally distributed. Under this assumption of normality, the
probability of returns being within three standard deviations of the mean (mean and standard
deviation will be reviewed in modules 2 and 3) is approximately 99.7%. This means that the
probability of returns being more than three standard deviations from the mean is approximately
0.3%. (In the graph above, the mean is 0 and the standard deviation is 1, so ‘mean ±3 standard
deviations’ equates to the interval [–3; 3]. This means that 99.7% of the total area under the curve
is between –3 and 3.) Assuming market returns follow a normal distribution is fundamental to
many models in finance, for example Markowitz's modern portfolio theory and the Black–Scholes–
Merton option pricing model. However, this assumption does not typically reflect actual observed

6
market returns and `tail events', i.e. black swan events (which recall are low-probability, high-
impact events), tend to occur more frequently than a normal distribution would predict! For now,
the moral of the story is to beware assumptions – if you make a wrong or invalid assumption, then
decisions you make in good faith may lead to outcomes far from what you expected. As an
example, the subprime mortgage market in the United States assumed house prices would only
ever increase, but what goes up usually comes down at some point.

ACTIVITIES/ASSESSMENT
True or False. Write “True” if the statement is correct, and “False” otherwise.

1. A black swan event is a low-probability high-impact event.


2. A model is a perfect representation of reality.
3. A good model, other things equal, departs significantly from reality.
4. In the Monty Hall problem, initially all three doors have the same probability of hiding the sports
car.
5. External Variables are not under our control.
6. Under the assumption of normality, there is approximately 99.7% probability of being within
three standards deviations of the mean.
7. There is a current skills surplus due to the supply of quantitative employees exceeding demand
from employers.
8. The classic London underground map allows tourist to get from point A to point B.
9. The “Circle Line” is a sensible choice of name for the “Circle Line”.
10. Decision-making is a process when one is faced with a problem or decision having more than
one possible outcome.

Watch:
“Evolution of the London Underground Map”
https://www.youtube.com/watch?v=1pMX7EkAhoA

“The Monty Hall Problem – Explained”


https://www.youtube.com/watch?v=9vRUxbzJZ9Y

7
MODULE 2 – QUANTIFYING UNCERTAINTY WITH PROBABILITY

OVERVIEW:
We're going to be considering quantifying uncertainty with probability. So in the first
module, this was just really the introduction to the course, and I wanted to get you thinking about
the general concepts of decision-making under uncertainty. And for example, we kicked off with
the Monty Hall problem. But of course, now we need to formalize things somewhat more. We
need to decide what exactly is probability and how do we quantify it? How do we determine this?
So to assist us with this, we do need to introduce some key vocabularies, some key lexicon, if
you will.
So we begin with the concept of an experiment, indeed a random experiment. Now
examples for this could be trivial things such as tossing a coin and seeing which is the upper most
face. Maybe it's rolling a die and seeing the score on the uppermost face there. Or maybe it could
be more sort of real world examples, whereby we look at some stock index like the FTSE 100 and
see what the change was in the value of that index on a particular trading day. So there's some
random experiment, which could lead to one of several possible outcomes. Now sometimes the
number of outcomes might be quite small. Tossing the coin, we've only really got two options
here, either the uppermost face is going to be heads or it's going to be tails. If we consider
extending it to rolling a die, well of course there, on a standard die, you have six possible values
for that uppermost face. The integers 1, 2, 3, 4, 5 and 6. If instead we consider the FTSE 100
Index, well of course, there are lots of possible values we can have for percentage change on a
particular trading day. So we have a random experiment, which results in a particular outcome.

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Quantify uncertainty with probability applied to some simple examples.
2. Recall a selection of common probability distributions.
3. Discuss how new information leads to revised beliefs.

COURSE MATERIALS:
2.1 Probability Principles

Probability is very important for statistics because it provides the rules which allow us to
reason about uncertainty and randomness, which is the basis of statistics and must be fully
understood in order to think clearly about any statistical investigation.

The first basic concepts in probability will be the following:


 Experiment: For example, rolling a single die and recording the outcome.
 Outcome of the experiment: For example, rolling a 3.
 Sample space S: The set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
 Event: Any subset A of the sample space, for example A = {4, 5, 6}.

Probability, P(A), will be defined as a function which assigns probabilities (real numbers)
to events (sets). This uses the language and concepts of set theory. So we need to study the
basics of set theory first.

8
A set is a collection of elements (also known as ‘members’ of the set).

Example: The following are all examples of sets, where “l” can be read as “such that”:
 𝐴 = {𝐴𝑚𝑦, 𝐵𝑜𝑏, 𝑆𝑎𝑚}
 𝐵 = [1, 2, 3, 4, 5}
 𝐶 = {𝑥|𝑥 𝑖𝑠 𝑎 𝑝𝑟𝑖𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟} = {2, 3, 5, 7, 11, … }
 𝐷 = {𝑥|𝑥 ≥ 0} (𝑡ℎ𝑎𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑛𝑜𝑛 − 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟𝑠

We consider four basic concepts in probability.

An experiment is a process which produces outcomes and which can have several different
outcomes. The sample space S is the set of all possible outcomes of the experiment. An event
is any subset A of the sample space such that A ⊂ S, where ⊂ denotes a subset.

Example:
If the experiment is ‘select a trading day at random and record the % change in the FTSE
100 index from the previous trading day’, then the outcome is the % change in the FTSE 100
index.

𝑆 = [−100, +∞) for the % change in the FTSE 100 index (in principle).

An event of interest might be 𝐴 = {𝑥|𝑥 > 0} – the event that the daily change is positive, i.e. the
FTSE 100 index gains value from the previous trading day. We would then denote the probability
of this event as:

P(A) = P(% daily change is positive)

What does probability mean? Probability theory tells us how to work with the probability
function and derive probabilities of events from it. However, it does not tell us what `probability'
really means.

We define probabilities to span the unit interval, i.e. [0, 1], such that for any event A we
have:
0 ≤ P(A) ≤1

At the extremes, an impossible event occurs with a probability of zero, and a certain event
occurs with a probability of one, hence P(S) = 1 by definition of the sample space. For any event
A, P(A) → 1 as the event becomes more likely, and P(A) → 0 as the event becomes less likely.
Therefore, the probability value is a quantified measure of how likely an event is to occur.

Figure 2

9
There are several alternative interpretations of the real-world meaning of “probability” in
this sense. One of them is outlined below. The mathematical theory of probability and calculations
on probabilities are the same whichever interpretation we assign to “probability”.

Frequency interpretation of probability states that the probability of an outcome A of an


experiment is the proportion (relative frequency) of trials in which A would be the outcome if the
experiment was repeated a very large number of times under similar conditions.

Example: How should we interpret the following, as statements about the real world of coins and
babies?
 ‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a large
number of times, and the proportion of heads out of those tosses was 0.5, the ‘probability
of heads’ could be said to be 0.5, for that coin.
 ‘The probability is 0.51 that a child born in the Philippines today is a boy.’ If the proportion
of boys among a large number of live births was 0.51, the ‘probability of a boy’ could be
said to be 0.51.

How to find probabilities? A key question is how to determine appropriate numerical


values, P(A), for the probabilities of particular events. In practice we could determine probabilities
using one of three methods:

 subjectively
 by experimentation (empirically)
 theoretically

Subjective estimates are employed when it is not feasible to conduct experimentation or


use theoretical tools. For example, although:
0 ≤ P(World War III starts next year) ≤1
as it is a probability (so must be between 0 and 1, inclusive), what is the correct value which
should be attributed to this? Clearly, we must resort to subjective estimates taking into account
relevant geopolitical events etc. Of course, the probabilistic evaluation of such information is
highly subjective, hence different people would assess the chance of this event happening with
different probabilities. As such there is no ‘right’ answer! That said, you may wish to do some
research on the ‘Doomsday Clock’, which is an attempt to determine how close humanity is to a
global catastrophe.

Ignoring extreme events like a world war, the determination of probabilities is usually done
empirically, by observing actual realizations of the experiment and using them to estimate
probabilities. In the simplest cases, this basically applies the frequency definition to observed
data.

Example:
 If I toss a coin 10,000 times, and 5,050 of the tosses come up heads, it seems that,
approximately, P(heads) = 0.5, for that coin.
 Of the 7,098,667 live births in England and Wales in the period 1999 - 2009, 51.26% were
boys. So we could assign the value of about 0.51 to the probability of a boy in that
population.

The estimation of probabilities of events from observed data is an important part of statistics!

10
2.2 Simple Probability Distributions

One can view probability as a quantifiable measure of one's degree of belief in a particular
event, or set of interest. Let us consider two simple experiments.

Example
i. The toss of a (fair) coin: S = {H, T} where H and T denote “heads” and “tails”, respectively,
and are called the elements or members of the sample space.

ii. The score of a (fair) die: S = {1, 2, 3, 4, 5, 6}

So the coin toss sample space has two elementary outcomes, H and T, while the score
on a die has six elementary outcomes. These individual elementary outcomes are themselves
events, but we may wish to consider slightly more exciting events of interest. For example, for the
score on a die, we may be interested in the event of obtaining an even score, or a score greater
than 4 etc. Hence we proceed to define an event of interest.

Typically, we can denote events by letters for notational efficiency. For example, A = “an
even score” and B = “a score greater than 4”. Hence A = {2, 4, 6} and B = {5, 6}.

The universal convention is that we define probability to lie on a scale from 0 to 1 inclusive
(Multiplying by 100 yields a probability as percentage). Hence the probability of any event A, say,
is denoted by P(A) and is a real number somewhere in the unit interval, i.e. P(A) ∈ [0, 1], where
“∈ " means “is a member of”. Note the following:

 If A is an impossible event, then P(A) = 0.


 If A is a certain event, then P(A) = 1.
 For events A and B, if P(A) > P(B), then A is more likely to occur than B.

Therefore, we have a probability scale from 0 to 1 on which we are able to rank events,
as evident from the P(A) > P(B) result above. However, we need to consider how best to quantify
these probabilities theoretically (we have previously considered determining probabilities
subjectively and by experimentation). Let us begin with experiments where each elementary
outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfill this criterion
(conveniently).

Classical probability is a simple special case where values of probabilities can be found
by just counting outcomes. This requires that:
 the sample space contains only a finite number of outcomes, N
 all of the outcomes are equally probable (equally likely).

Standard illustrations of classical probability are devices used in games of chance:


 tossing a fair coin (heads or tails) one or more times
 rolling one or more fair dice (each scored 1, 2, 3, 4, 5 or 6)
 drawing one or more playing cards at random from a deck of 52 cards.

We will use these often, not because they are particularly important but because they provide
simple examples for illustrating various results in probability.

11
Suppose that the sample space S contains N equally likely outcomes, and that event A consists
of n ≤ N of these outcomes. We then have that:

𝑛 number of outcomes in 𝐴
𝑃(𝐴) = =
𝑁 total number of outcomes in the sample space 𝑆

That is, the probability of A is the proportion of outcomes which belong to A out of all possible
outcomes. In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible outcomes.

Example
i. For the coin toss, if A is the event “heads”, then N = 2 (H and T) and n = 1 (H). So,for a
fair coin, P(A) = 1/2 = 0.5
ii. For the die score, if A is the event “an even score”, then N = 6 (1, 2, 3, 4, 5 and 6) and
n = 3 (2, 4 and 6). So, for a fair die, P(A) = 3/6 = 1/2 = 0.5
Finally, if B is the event “score greater than 4”, then N = 6 (as before) and n = 2 (5 and 6).
Hence P(B) = 2/6 = 1/3.

Example: Rolling two dice, what is the probability that the sum of the two scores is 5?
 Determine the sample space, which is the 36 ordered pairs:

 Determine the outcomes in the event A = {(1, 4), (2, 3), (3, 2), (4, 1)} (highlighted).
 Determine the probability to be P(A) = 4/36 = 1/9.

A random variable is a ‘mapping’ of the elementary outcomes in the sample space to real
numbers. This allows us to attach probabilities to the experimental outcomes. Hence the concept
of a random variable is that of a measurement which takes a particular value for each possible
trial (experiment). Frequently, this will be a numerical value.

Example:

1. Suppose we sample at random five people and measure their heights, hence ‘height’ is the
random variable and the five (observed) values of this random variable are the realized
measurements for the heights of these five people.

2. Suppose a fair die is thrown four times and we observe two 6’s, a 3 and a 1. The random
variable is the ‘score on the die’, and for these four trials it takes the values 6, 6, 3 and 1. (In this
case, since we do not know the true order in which the values occurred, we could also say that
the results were 1, 6, 3 and 6 or 1, 3, 6 and 6, or . . .)

12
An example of an experiment with non-numerical outcomes would be a coin toss, for which
recall S = {H, T}. We can use a random variable, X, to convert the sample space elements to real
numbers such as:
1 𝑖𝑓 ℎ𝑒𝑎𝑑𝑠
𝑋={
0 𝑖𝑓 𝑡𝑎𝑖𝑙𝑠

The value of any of the above variables will typically vary from sample to sample, hence
the name “random variable”. So each experimental random variable has a collection of possible
outcomes, and a numerical value associated with each outcome. We have already encountered
the term “sample space” which here is the set of all possible numerical values of the random
variable.

Example: Examples of random variables include the following:

A natural question to ask is “what is the probability of any of these values?”. That is, we
are interested in the probability distribution of the experimental random variable. Be aware that
random variables comes in two varieties – discrete and continuous.

 Discrete: Synonymous with ‘count data’, that is, random variables which take non-
negative integer values, such as 0, 1, 2, …. For example, the number of heads in n coin
tosses.
 Continuous: Synonymous with ‘measured data’ such as the real line, ℝ = (−∞, ∞)or
some subset of ℝ, for example the unit interval [0, 1]. For example, the height of adults in
centimeters.

The mathematical treatment of probability distributions depends on whether we are


dealing with discrete or continuous random variables. We will tend to focus on discrete random
variables for much of this module. In most cases there will be a higher chance of the random
variable taking some sample space values relative to others. Our objective is to express these
chances using an associated probability distribution. In the discrete case, we can associate with
each ‘point’ in the sample space a probability which represents the chance of the random variable
being equal to that particular value. (The probability is typically non-zero, although sometimes we
need to use a probability of zero to identify impossible events.)

To summarize, a probability distribution is the complete set of sample space values with
their associated probabilities which must sum to 1 for discrete random variables. The probability
distribution can be represented diagrammatically by plotting the probabilities against sample
space values. Finally, before we proceed, let us spend a moment to briefly discuss some
important issues with regard to the notation associated with random variables. For notational
efficiency reasons, we often use a capital letter to represent the random variable. The letter X is

13
often adopted, but it is perfectly legitimate to use any other letter: Y, Z etc. In contrast, a lower
case letter denotes a particular value of the random variable.

Example: Let X = “the upper-faced score after rolling a fair die”. If the die results in a 3, then this
is written as x = 3. The probability distribution of X is:

This is an example of the (discrete) uniform distribution. For discrete random variables,
we talk about a mass of probability at each respective sample space value. In the discrete uniform
case this mass is the same, i.e. 1/6, and this is plotted to show the probability distribution of X, as
shown below.

2.3 Expectation of Random Variables

Certain important properties of distributions arise if we consider probability-weighted


averages of random variables, and of functions of random variables. For example, we might want
to know the “average” value of a random variable. It would be foolish to simply take the arithmetic
average of all the values taken by the random variable, as this would mean that very unlikely
values (those with small probabilities of occurrence) would receive the same weighting as very
likely values (those with large probabilities of occurrence).

14
The obvious approach is to use the probability-weighted average of the sample space
values, known as the expected value of X. If 𝑥1 , 𝑥2 , … , 𝑥𝑁 are the possible values of the random
variable X, with corresponding probabilities 𝑝(𝑥)1 , 𝑝(𝑥)2 , … , 𝑝(𝑥)𝑛 then:
𝑁

𝐸(𝑋) = 𝜇 = ∑ 𝑥𝑖 𝑝(𝑥)𝑖 = 𝑥1 𝑝(𝑥)1 + 𝑥2 𝑝(𝑥)2 + ⋯ + 𝑥𝑁 𝑝(𝑥)𝑁


𝑖=1

Note that the expected value is also referred to as the population mean, which can be
written as E(X) (in words “the expectation of the random variable X”) or 𝜇 (in words “the
(population) mean of X”). So, for so-called ‘discrete’ random variables, E(X) is determined by
taking the product of each value of X and its corresponding probability, and then summing across
all values of X.

Example:

If the ‘random variable’ X happens to be a constant, k, then x1 = k, and p1 = 1, so trivially


E(X) = k × 1 = k. Of course, here X is not ‘random’, but a constant and hence its expectation is k
as it can only ever take the value k!
Note: A function, f(X), of a random variable X is, of course, a new random variable, say Y = f(X).

Example: Let X represent the value shown when a fair die is thrown once.

Hence,

We should view 3.5 as a long-run average since, clearly, the score from a single roll of a die can
never be 3.5, as it is not a member of the sample space. However, if we rolled the die a (very)
large number of times, then the average of all of these outcomes would be (approximately) 3.5.
For example, suppose we rolled the die 600 times and observed the frequencies of each score.
Let us suppose we observed the following frequencies:

15
The average observed score is:

So we see that in the long run the average score is approximately 3.5. Note a different 600 rolls
of the die might lead to a different set of frequencies. Although we might expect 100 occurrences
of each score of 1 to 6 (that is, taking a relative frequency interpretation of probability, as each
score occurs with a probability of 1/6 we would expect one sixth of the time to observe each score
of 1 to 6), it is unlikely we would observe exactly 100 occurrences of each score in practice.

Example: Recall the toss of a fair coin, where we define the random variable X such that:

1 𝑖𝑓 ℎ𝑒𝑎𝑑𝑠
𝑋={
0 𝑖𝑓 𝑡𝑎𝑖𝑙𝑠

Since the coin is fair, then P(X = 0) = P(X = 1) = 0.5, hence:

E(X) = 0 × 0.5 + 1 × 0.5 = 0.5

Here, viewed as a long-run average, E(X) = 0.5 can be interpreted as the proportion of heads in
the long run (and, of course, the proportion of tails too).

Example:

Let us consider the game of roulette, from the point of view of the casino (The House). Suppose
a player puts a bet of $1 on ‘red’. If the ball lands on any of the 18 red numbers, the player gets
that $1 back, plus another $1 from The House. If the result is one of the 18 black numbers or the
green 0, the player loses the $1 to The House. We assume that the roulette wheel is unbiased,
i.e. that all 37 numbers have equal probabilities. What can we say about the probabilities and
expected values of wins and losses? Define the random variable X = “money received by The
House”. Its possible values are – 1 (the player wins) and 1 (the player loses). The probability
function is:
18
for x = −1
37
P(X = x) = p(x) = 19
for x = 1
37
{0 otherwise

Where p(x) is shortened version of P(X=x). Therefore, the expected value is:

18 19
𝐸(𝑋) = (−1 × ) + (1 × ) = +0.027
37 37

On average, The House expects to win 2.7p for every $1 which players bet on red. This expected
gain is known as the house edge. It is positive for all possible bets in roulette.

16
The mean (expected value) E(X) of a probability distribution is analogous to the sample mean
̅of a sample distribution (introduced in module 3). This is easiest to see when the
(average) X
sample space is finite.

Suppose the random variable X can have K different valuesX1, …, Xk, and their frequencies in a
random sample are f1, …, fK, respectively. Therefore, the sample mean of X is:
𝐾
𝑓1 𝑥1 + ⋯ + 𝑓𝐾 𝑥𝐾
𝑋̅ = = 𝑥1 𝑝̂ (𝑥1 ) + ⋯ + 𝑥𝐾 𝑝̂ (𝑥𝐾 ) = ∑ 𝑥𝑖 𝑝̂ (𝑥𝑖 )
𝑓1 + ⋯ + 𝑓𝐾
𝑖=1
where:
𝑓𝑖
𝑝̂ (𝑥𝑖 ) =
∑𝐾
𝑖=1 𝑓𝑖

are the sample proportions of the values xi. The expected value of the random variable X is:
𝐾

𝐸(𝑋) = 𝑥1 𝑝(𝑥)1 + ⋯ + 𝑥𝑘 𝑝(𝑥)𝑘 = ∑ 𝑥𝑖 𝑝(𝑥𝑖 )


𝑖=1

So 𝑋̅ uses the sample proportions, 𝑝̂ (𝑥𝑖 ), whereas 𝐸(𝑋) uses the population probabilities, 𝑝(𝑥𝑖 ).

2.4 Bayesian Updating

Bayesian updating is the act of updating your (probabilistic) beliefs in light of new
information. Formally named after Thomas Bayes (1701 – 1761), for two events A and B, the
simplest form of Bayes' theorem is:

P(B|A) P(A)
P(A|B) =
P(B)

Example: Consider the probability distribution of the score on a fair die.

Suppose we define the event A to be “roll a 6”. Unconditionally, i.e. a priori (before we receive
any additional information), we have:

P(A) = P(X = 6) = 1/6

Now let us suppose we are told that the event:

B = even score = {2, 4, 6}

has occurred (where P(B) = 1/2), which means we can effectively revise our sample space, S*,
by eliminating 1, 3 and 5 (the odd scores), such that:

S* = {2, 4, 6}

17
So now the revised sample space contains three equally likely outcomes (instead of the original
six), so the Bayesian updated probability (known as a conditional probability or a posteriori
probability) is:
1
P(A|B) =
3

where "|" can be read as “given”, hence A|B means “A given B”. Deriving this result formally using
Bayes' theorem, we already have P(A) = 1/6 and also P(B) = 1/2, so we just need P(B|A), which
is the probability of an even score given a score of 6. Since 6 is an even score, P(B|A)= 1. Hence:

P(B|A) P(A) 1 × 1/6 2 1


P(A|B) = = = =
P(B) 1/2 6 3

Suppose instead we consider the case where we are told that an odd score was obtained. Since
even scores and odd scores are mutually exclusive (they cannot occur simultaneously) and
collectively exhaustive (a die score must be even or odd), then we can view this as the
complementary event, denoted Bc, such that:

Bc = odd score = {1, 3, 5} and P(Bc) = 1 – P(B) = 1/2

So, given an odd score, what is the conditional probability of obtaining a 6? Intuitively, this is zero
(an impossible event), and we can verify this with Bayes' theorem:

P(B c |A) P(A) 0 × 1/6


P(A|B c ) = = =0
P(B c ) 1/2

where, clearly, we have P(B c |A) = 0 (since 6 is an even, not odd, score, so it is impossible to
obtain an odd score given the score is 6).

Example:

Suppose that 1 in 10,000 people (0.01%) has a particular disease. A diagnostic test for the
disease has 99% sensitivity (if a person has the disease, the test will give a positive result with a
probability of 0.99). The test has 99% specificity (if a person does not have the disease, the test
will give a negative result with a probability of 0.99).

Solution:

Let B denote the presence of the disease, and Bc denote no disease. Let A denote a positive test
result. We want to calculate P(A).

The probabilities we need are P(B) = 0.0001, P(Bc) = 0.9999, P(A|B) = 0.99 and also P(A|B c )=
0.01, and hence:

P(A) = P(A|B)P(B) + P(A|B c )P(B c )


= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098

18
We want to calculate P(B|A), i.e. the probability that a person has the disease, given that the
person has received a positive test result.

The probabilities we need are:

P(B) = 0.0001
P(B c ) = 0.9999
P(A|B) = 0.99
P(A|B c ) = 0.01

and so

P(A|B) P(B) 0.99 × 0.0001


P(B|A) = c )P(B c = ≈ 0.0098
P(A|B)P(B) + P(A|B ) 0.010098

Why is this so small? The reason is because most people do not have the disease and the test
has a small, but non-zero, false positive rate of P(A|B c ). Therefore, most positive test results are
actually false positives.

In order to revisit the ‘Monty Hall’ problem, we require a more general form of Bayes' theorem,
which we note as follows. For a general partition (partition is the division of the sample space
into mutually exclusive and collectively exhaustive events) of the sample space S into B1,B2,…,Bn,
and for some event A, then:

P(A|Bk ) P(Bk )
P(Bk |A) =
∑ni=1(A|Bi )P(Bi )

Example:

You are taking part in a gameshow. The host of the show, who is known as Monty, shows you
three outwardly identical doors. Behind one of them is a prize (a sports car), and behind the other
two are goats. You are asked to select, but not open, one of the doors. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining doors. He always opens a
door he knows will reveal a goat, and randomly chooses which door to open when he has more
than one option (which happens when your initial choice contains the prize). After revealing a
goat, Monty gives you the choice of either switching to the other unopened door or sticking with
your original choice. You then receive whatever is behind the door you choose. What should you
do, assuming you want to win the prize?

Suppose the three doors are labelled A, B and C. Let us define the following events.
 DA, DB, DC: the prize is behind Door A, B and C, respectively.
 MA, MB, MC: Monty opens Door A, B and C, respectively.

Suppose you choose Door A first, and then Monty opens Door B (the answer works the same
way for all combinations of these). So Doors A and C remain unopened. What we want to know
now are the conditional probabilities P(DA |MB ) and P(DC |MB ).

You should switch doors if P(DC |MB ) > P(DA |MB ), and stick with your original choice otherwise.
(You would be indifferent about switching if it was the case that P(DC |MB ) = P(DA |MB ) .)

19
Suppose that you first choose Door A, and then Monty opens Door B. Bayes' theorem tells us
that:

P(MB |DC )P(DC )


P(DC |MB ) =
P(MB |DA )P(DA ) + P(MB |DB )P(DB ) + P(MB |DC )P(DC )

We can assign values to each of these.


 The prize is initially equally likely to be behind any of the doors. Therefore, we have
P(DA ) = P(DB ) = P(DC ) = 1/3
 If the prize is behind Door A (which you choose), Monty chooses at random between the
two remaining doors, i.e. Doors B and C. Hence, P(MB |DA ) = 1/2.
 If the prize is behind one of the two doors you did not choose, Monty cannot open that
door, and must open the other one. Hence P(MB |DC ) = 1 and P(MB |DB ) = 0.

Putting these probabilities into the formula gives:

1 × 1/3 2
P(DC |MB ) =
1 1 1 1 3
2×3+0×3+1×3

And hence, P(DA |MB ) = 1 − P(DC |MB ) = 1/3 [because also P(MB |DB ) = 0 and so P(DB |MB ) = 0]

The same calculation applies to every combination of your first choice and Monty's choice.
Therefore, you will always double your probability of winning the prize if you switch from your
original choice to the door that Monty did not open. The Monty Hall problem has been called a
cognitive illusion, because something about it seems to mislead most people's intuition. In
experiments, around 85% of people tend to get the answer wrong at first. The most common
incorrect response is that the probabilities of the remaining doors after Monty's choice are both
1/2, so that you should not (or rather need not) switch.

This is typically based on ‘no new information’ reasoning. Since we know in advance that Monty
will open one door with a goat behind it, the fact that he does so appears to tell us nothing new
and should not cause us to favor either of the two remaining doors – hence a probability of 1/2 for
each (people see only two possible doors after Monty's action and implicitly apply classical
probability by assuming each door is equally likely to reveal the prize).

It is true that Monty's choice tells you nothing new about the probability of your original choice,
which remains at 1/3. However, it tells us a lot about the other two doors. First, it tells us everything
about the door he chose, namely that it does not contain the prize. Second, all of the probability
of that door gets ‘inherited’ by the door neither you nor Monty chose, which now has the probability
2/3.

So, the moral of the story is to switch! Note here we are using updated probabilities to form a
strategy – it is sensible to ‘play to the probabilities’ and choose as your course of action that which
gives you the greatest chance of success (in this case you double your chance of winning by
switching door). Of course, just because you pursue a course of action with the most likely chance
of success does not guarantee you success.

20
If you play the Monty Hall problem (and let us assume you switch to the unopened door), you can
expect to win with a probability of 2/3, i.e. you would win 2/3 of the time on average. In any single
play of the game, you are either lucky or unlucky in winning the prize. So you may switch and end
up losing (and then think you applied the wrong strategy – hindsight is a wonderful thing!) but in
the long run you can expect to win twice as often as you lose, such that in the long run you are
better off by switching!

If you feel like playing the Monty Hall game again, I recommend visiting:

http://www.math.ucsd.edu/~crypto/Monty/monty.html.

In particular, note how at the end of the game it shows the percentage of winners based on
multiple participants’ results. Taking the view that in the long run you should win approximately
2/3 of the time from switching door, and approximately 1/3 of the time by not switching, observe
how the percentages of winners tend to 66.7% and 33.3%, respectively, based on a large sample
size. Indeed, when we touch on statistical inference later in the module, it is emphasized that as
the sample size increases we tend to get a more representative (random) sample of the
population. Here, this equates to the sample proportions of wins converging to their theoretical
probabilities. Note also the site has an alternative version of the game where Monty does not
know where the sports car is.

2.5 Parameters

Probability distributions may differ from each other in a broader or narrower sense. In the broader
sense, we have different families of distributions which may have quite differentcharacteristics,
for example:
 discrete distributions versus continuous distributions
 among discrete distributions: a finite versus an infinite number of possible values
 among continuous distributions: di_erent sets of possible values (for example, all real
numbers x, x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions.

These ‘distributions’ are really families of distributions in this sense.

In the narrower sense, individual distributions within a family differ in having different values of
the parameters of the distribution. The parameters determine the mean and variance of the
distribution, values of probabilities from it etc.

In the statistical analysis of a random variable X we typically:


 select a family of distributions based on the basic characteristics of X
 use observed data to choose (estimate) values for the parameters of that distribution, and
perform statistical inference on them.

Example:

An opinion poll on a referendum, where each Xi is an answer to the question “Will you vote ‘Yes’
or ‘No’ to joining/leaving1 the European Union?” has answers recorded as Xi = 0 if ‘No’ and Xi =
1 if ‘Yes’. In a poll of 950 people, 513 answered ‘Yes’.

How do we choose a distribution to represent Xi?

21
 Here we need a family of discrete distributions with only two possible values (0 and 1).
The Bernoulli distribution (discussed below), which has one parameter 𝜋 (the probability
that Xi = 1) is appropriate.
 Within the family of Bernoulli distributions, we use the one where the value of 𝜋 is our best
estimate based on the observed data. This is 𝜋̂ = 513=950 = 0.54 (where 𝜋̂ denotes an
estimate of the parameter 𝜋).

A Bernoulli trial is an experiment with only two possible outcomes. We will number these
outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively. Note these are
notional successes and failures – the success does not necessarily have to be a ‘good’ outcome,
nor a failure a ‘bad’ outcome!

Examples of outcomes of Bernoulli trials are:


 agree / disagree
 pass a test / fail a test
 employed / unemployed
 owns a car / does not own a car
 business goes bankrupt / business continues trading.

The Bernoulli distribution is the distribution of the outcome of a single Bernoulli trial, named
after Jacob Bernoulli (1654 – 1705). This is the distribution of a random variable X with the
following probability function (A probability function is simply a function which returns the
probability of a particular value of X.):

Therefore:

And:

and no other values are possible. We could express this family of Bernoulli distributions in tabular
form as follows:

where 0 ≤ 𝜋 ≤ 1 is the probability of ‘success’. Note that just as a sample space represents all
possible values of a random variable, a parameter space represents all possible values of a
parameter. Clearly, as a probability, we must have that 0 ≤ 𝜋 ≤ 1.

22
Such a random variable X has a Bernoulli distribution with (probability) parameter 𝜋. This is
often written as:

If X ~ Bernoulli (𝜋), then we can determine its expected value, i.e. its mean, as the usual
probability-weighted average:

Hence we can view 𝜋 as the long-run average (proportion) of successes if we were to draw a
large random sample from this distribution.

Different members of this family of distributions differ in terms of the value of 𝜋.

Example:

Consider the toss of a fair coin, where X = 1 denotes ‘heads’ and X = 0 denotes ‘tails’. As this is
a fair coin, heads and tails are equally likely and hence 𝜋 = 0.5 leading to the specific Bernoulli
distribution:

Hence,

such that if we tossed a fair coin a large number of times, we would expect the proportion of heads
to be 0.5 (and in practice the long-run proportion of heads would be approximately 0.5).

2.6 The Distribution Zoo

Suppose we carry out n Bernoulli trials such that:


 at each trial, the probability of success is 𝜋
 different trials are statistically independent events.

Let X denote the total number of successes in these n trials, then X follows a binomial distribution
with parameters n and 𝜋, where n ≥ 1 is a known integer and 0 ≤ 𝜋 ≤ 1. This is often written as:

X ~ Bin(n, 𝜋)
If X ~ Bin(n, 𝜋), then:
E(X) = n 𝜋

Example:

A multiple choice test has 4 questions, each with 4 possible answers. James is taking the test,
but has no idea at all about the correct answers. So he guesses every answer and, therefore, has
the probability of 1/4 of getting any individual question correct.

23
Let X denote the number of correct answers in James’ test. X follows the binomial distribution with
n = 4 and 𝜋 = 0.25, i.e. we have:

X ~ Bin(4, 0.25)

For example, what is the probability that James gets 3 of the 4 questions correct?

Here it is assumed that the guesses are independent, and each has the probability 𝜋 = 0.25 of
being correct.

The probability of any particular sequence of 3 correct and 1 incorrect answers, for example 1110,
is 𝜋 3 (1−𝜋)1, where ‘1’ denotes a correct answer and ‘0’ denotes an incorrect answer.

However, we do not care about the order of the 0s and 1s, only about the number of 1s. So 1101
and 1011, for example, also count as 3 correct answers. Each of these also has the probability
of 𝜋 3 (1−𝜋)1.

The total number of sequences with three 1s (and, therefore, one 0) is the number of locations
for the three 1s which can be selected in the sequence of 4 answers. This is (43) = 4 (see below).
Therefore, the probability of obtaining three 1s is:

4
( ) 𝜋 3 (1 − 𝜋)1 = 4 × (0.25)3 × 0.751 ≈ 0.0469
3

In general, the probability function of X ~ Bin(n, 𝜋) is:

where (𝑛𝑥) is the binomial coefficient – in short, the number of ways of choosing x objects out of n
when sampling without replacement when the order of the objects does not matter.

(𝑛𝑥) can be calculated as:

24
Example:

25
Poisson Distribution

26
Example:

Example:

Example:

27
Example:

Connections between Probability Distributions

28
Poisson Approximation of the Binomial Distribution

Example: Bortkiewicz's horses

29
30
ACTIVITIES/ASSESSMENT:

Direction: Choose the letter of the correct answer.

1. An event is:
a. any subset A of the sample space
b. a probability
c. a function which assigns probabilities to events

2. Probability…
a. is always some value in the unit interval [0, 1]
b. can exceed 1
c. can be negative

3. Probabilities of geopolitical events are estimated:


a. subjectively
b. by experimentation
c. theoretically

4. When there are N equally likely outcomes in a sample space, where n < N of them agree with
some event A, then:
a. P(A) = n/N
b. P(A) = N/n
c. P(A) = 1

5. The expectation of a (discrete) random variable X:


a. is a probability-weighted average
b. is a non-probability-weighted average
c. must be equal to a possible value of X

6. If X represents the score on a fair die, then E(X) is:


a. 2.5
b. 3.5
c. 4.5

7. For a family of Bernoulli distribution 𝜋 is a:


a. variable
b. sample space
c. parameter

8. For any Bernoulli random variable, P(X=0) is:


a. 0
b. 𝜋
c. 1 − 𝜋

9. To apply the binomial distribution, there must be a constant probability of success.


a. True
b. False

31
MODULE 3 – DESCRIPTIVE STATISTICS

OVERVIEW:
Descriptive statistics are a simple, yet powerful, tool for data reduction and summarization.
Many of the questions for which people use statistics to help them understand and make decisions
involve types of variables which can be measured. Obvious examples include height, weight,
temperature, lifespan, rate of inflation and so on. When we are dealing with such a variable – for
which there is a generally recognized method of determining its value – we say that it is a
measurable variable. The numbers which we then obtain come ready – equipped with an order
relation, i.e. we can always tell if two measurements are equal (to the available accuracy) or if
one is greater or less than the other.

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Explain the different levels of measurement of variables.
2. Explain the importance of data visualization and descriptive statistics.
3. Compute common descriptive statistics for measurable variables.

COURSE MATERIALS:
3.1 Classifying Variables

Data are obtained on any desired variable. For most of this module, we will be dealing with
variables which can be partitioned into two types.

1. Discrete data – things you can count. Examples include the number of passengers on a flight
and the number of telephone calls received each day in a call center. Observed values for these
will be 0, 1, 2, . . . (i.e. non-negative integers).

2. Continuous data – things you can measure. Examples include height, weight and time which
can be measured to several decimal places.

Of course, before we do any sort of statistical analysis, we need to collect data. Module 4 will
discuss a range of different techniques which can be employed to obtain a sample. For now, we
just consider some examples of situations where data might be collected.

 A pre-election opinion poll asks 1,000 people about their voting intentions.
 A market research survey asks how many hours of television people watch per week.
 A census interviewer asks each householder how many of their children are receiving full-
time education.

Categorical VS. Measurable Variables

A polling organization might be asked to determine whether, say, the political preferences of
voters were in some way linked to their job type – for example, do supporters of Party X tend to
be blue-collar workers? Other market research organizations might be employed to determine

32
whether or not users were satisfied with the service which they obtained from a commercial
organization (a restaurant, say) or a department of local or central government (housing
departments being one important instance).

This means that we are concerned, from time to time, with categorical variables in addition to
measurable variables. So we can count the frequencies with which an item belongs to particular
categories. Examples include:

(a) the total number of blue-collar workers (in a sample)


(b) the total number of Party X supporters (in a sample)
(c) the number of blue-collar workers who support Party X
(d) the number of Party X supporters who are blue-collar workers
(e) the number of diners at a restaurant who were dissatisfied/indifferent/satisfied with the service.

In cases (a) and (b) we are doing simple counts, within a sample, of a single category, while in
cases (c) and (d) we are looking at some kind of cross-tabulation between variables in two
categories: worker type vs. political preference in (c), and political preference vs. worker type in
(d) (they are not the same!).

There is no unambiguous and generally agreed way of putting worker types in order (in the way
that we can certainly say that 1 < 2). It is similarly impossible to rank (as the technical term has
it) many other categories of interest: for instance in combating discrimination against people
organizations might want to look at the effects of gender, religion, nationality, sexual orientation,
disability etc. but the whole point of combating discrimination is that different ‘varieties’ within each
category cannot be ranked.

In case (e), by contrast, there is a clear ranking – the restaurant would be pleased if there were
lots of people who expressed themselves satisfied rather than dissatisfied. Such considerations
lead us to distinguish two main types of variable, the second of which is itself subdivided.

 Measurable variables are those where there is a generally recognized method of


measuring the value of the variable of interest.
 Categorical variables are those where no such method exists (or, often enough, is even
possible), but among which:
- some examples of categorical variables can be put in some sensible order (case (e)),
and hence are called ordinal (categorical) variables
- some examples of categorical variables cannot be put in any sensible order, but are
only known by their name, and hence are called nominal (categorical) variables.

Nominal Categorical Variables

For a nominal variable (like gender), the numbers (values) serve only as labels or tags for
identifying and classifying cases. When used for identification, there is a strict one-to-one
correspondence between the numbers and the cases. For example, your passport or driving
license number uniquely identifies you.

Any numerical values do not reflect the amount of the characteristic possessed by the cases.
Counting is the only arithmetic operation on values measured on a nominal scale, and hence only
a very limited number of statistics, all of which are based on frequency counts, can be determined.

33
Ordinal Categorical Variables

An ordinal variable has a ranking scale in which numbers are assigned to cases to indicate the
relative extent to which the cases possess some characteristic. It is possible to determine if a
case has more or less of a characteristic than some other case, but not how much more or less.

Any series of numbers can be assigned which preserves the ordered relationships between the
cases. In addition to the counting operation possible with nominal variables, ordinal variables
permit the use of statistics based on centiles such as percentiles, quartiles and the median.

Interval Measurable Variables

Interval-level variables have scales where numerically equal distances on the scale represent
equal value differences in the characteristic being measured. For example, if the temperatures on
three days were 0, 10 and 20 degrees, then there is a constant 10-degree differential between 0
and 10, and 10 and 20. This allows comparisons of differences between values. The location of
the zero point is not fixed – both the zero point and the units of measurement are arbitrary. For
example, temperature can be measured in different (arbitrary) units, such as degrees Celsius and
degrees Fahrenheit.

Any positive linear transformation of the form y = a + bx will preserve the properties of the scale,
hence it is not meaningful to take ratios of scale values. Statistical techniques which may be used
include all of those which can be applied to nominal and ordinal variables. In addition statistics
such as the mean and standard deviation are applicable.

Ratio Measurable Variables

Ratio-level variables possess all the properties of nominal, ordinal and interval variables. A ratio
variable has an absolute zero point and it is meaningful to compute ratios of scale values. Only
proportionate transformations of the form y = bx, where b is a positive constant, are allowed. All
statistical techniques can be applied to ratio data.

Example:
Consider the following three variables describing different characteristics of countries. Later, we
consider a sample of 155 countries in 2002 for these variables.
 Region of the country. This is a nominal variable which could be coded (in alphabetical
order) as follows: 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 = Northern
America, 6 = Oceania.
 The level of democracy, i.e. a democracy index, in the country. This could be an 11-
point ordinal scale from 0 (lowest level of democracy) to 10 (highest level of
democracy).
 Gross domestic product per capita (GDP per capita) (i.e. per person, in $000s) which
is a ratio scale.

Region and the level of democracy are discrete, with the possible values of 1, 2, …, 6, and 0, 1,
2, …, 10, respectively. GDP per capita is continuous, taking any non-negative value.

Many discrete variables have only a finite number of possible values. The region variable has 6
possible values, and the level of democracy has 11 possible values.

34
The simplest possibility is a binary, or dichotomous, variable, with just two possible values. For
example, a person's gender could be recorded as 1 = female and 2 = male. A discrete variable
can also have an unlimited number of possible values. For example, the number of visitors to a
website in a day: 0, 1, 2, 3, 4,…

The levels of democracy have a meaningful ordering, from less democratic to more democratic
countries. The numbers assigned to the different levels must also be in this order, i.e. a larger
number = more democratic.

In contrast, different regions (Africa, Asia, Europe, Latin America, Northern America and Oceania)
do not have such an ordering. The numbers used for the region variable are just labels for different
regions. A different numbering (such as 6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 =
Northern America and 4 = Oceania) would be just as acceptable as the one we originally used.

3.2 Data Visualization

Statistical analysis may have two broad aims.

1. Descriptive statistics – summarize the data which were collected, in order to make them more
understandable.
2. Statistical inference – use the observed data to draw conclusions about some broader
population.

Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential first step.

Data do not just speak for themselves. There are usually simply too many numbers to make sense
of just by staring at them. Descriptive statistics attempt to summarize some key features of the
data to make them understandable and easy to communicate. These summaries may be
graphical or numerical (tables or individual summary statistics).

The statistical data in a sample are typically stored in a data matrix:

35
Rows of the data matrix correspond to different units (subjects/observations). Here, each unit is
a country.

The number of units in a dataset is the sample size, typically denoted by n. Here, n = 155
countries.

Columns of the data matrix correspond to variables, i.e. different characteristics of the units. Here,
region, the level of democracy, and GDP per capita are the variables.

Sample distribution

The sample distribution of a variable consists of:


 a list of the values of the variable which are observed in the sample
 the number of times each value occurs (the counts or frequencies of the observed values).
When the number of different observed values is small, we can show the whole sample
distribution as a frequency table of all the values and their frequencies.

The observations of the region variable in the sample are:

We may construct a frequency table for the region variable as follows:

36
Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the sample. This
is a measure of proportion (that is, relative frequency).

Similarly, for the level of democracy, the frequency table is:

‘Cumulative %’ for a value of the variable is the sum of the percentages for that value and all
lower-numbered values.

A bar chart is the graphical equivalent of the table of frequencies. The next figure displays the
region variable data as a bar chart. The relative frequencies of each region are clearly visible.

37
If a variable has many distinct values, listing frequencies of all of them is not very practical. A
solution is to group the values into non-overlapping intervals, and produce a table or graph of the
frequencies within the intervals.

The most common graph used for this is a histogram. A histogram is like a bar chart, but without
gaps between bars, and often uses more bars (intervals of values) than is sensible in a table.
Histograms are usually drawn using statistical software, such as Minitab, R or SPSS.

You can let the software choose the intervals and the number of bars. A table of frequencies for
GDP per capita where values have been grouped into non-overlapping intervals is shown below.

The next figure shows a histogram of GDP per capita with a greater number of intervals to better
display the sample distribution.

Associations Between Two Variables

So far, we have tried to summarize (some aspect of) the sample distribution of one variable at a
time. However, we can also look at two (or more) variables together. The key question is then
whether some values of one variable tend to occur frequently together with particular values of
another, for example high values with high values. This would be an example of an association

38
between the variables. Such associations are central to most interesting research questions, so
you will hear much more about them in the next topics.

Some common methods of descriptive statistics for two-variable associations are introduced here,
but only very briefly now and mainly through examples.

The best way to summarize two variables together depends on whether the variables have ‘few’
or ‘many’ possible values. We illustrate one method for each combination, as listed below.

 ‘Many’ versus ‘many’: scatterplots.


 ‘Few’ versus ‘many’: side-by-side boxplots.
 ‘Few’ versus ‘few’: two-way contingency tables (cross-tabulations).

A scatterplot shows the values of two measurable variables against each other, plotted as points
in a two-dimensional coordinate system.

Example: A plot of data for 164 countries is shown below which plots the following variables.

 On the horizontal axis (the x-axis): a World Bank measure of ‘control of corruption’, where
high values indicate low levels of corruption.
 On the vertical axis (the y-axis): GDP per capita in $.

Interpretation: it appears that virtually all countries with high levels of corruption have relatively
low GDP per capita. At lower levels of corruption there is a positive association, where countries
with very low levels of corruption also tend to have high GDP per capita.

Boxplots are useful for comparisons of how the distribution of a measurable variable varies
across different groups, i.e. across different levels of a categorical variable.

39
The figure below shows side-by-side boxplots of GDP per capita for the different regions.

 GDP per capita in African countries tends to be very low. There is a handful of countries
with somewhat higher GDPs per capita (shown as outliers in the plot).

 The median for Asia is not much higher than for Africa. However, the distribution in Asia
is very much skewed to the right, with a tail of countries with very high GDPs per capita.

 The median in Europe is high, and the distribution is fairly symmetric.

 The boxplots for Northern America and Oceania are not very useful, because they are
based on very few countries (two and three countries, respectively).

A (two-way) contingency table (or cross-tabulation) shows the frequencies in the sample of each
possible combination of the values of two categorical variables. Such tables often show the
percentages within each row or column of the table.

Example: The table below reports the results from a survey of 972 private investors. The variables
are as follows.

 Row variable: age as a categorical, grouped variable (four categories).


 Column variable: how much importance the respondent places on short-term gains from
his/her investments (four levels).

40
Numbers in parentheses are percentages within the rows. For example, 25.3 = (37=146)×100.

Interpretation: look at the row percentages. For example, 17.8% of those aged under 45, but only
5.2% of those aged 65 and over, think that short-term gains are ‘very important’. Among the
respondents, the older age groups seem to be less concerned with quick profits than the younger
age groups.

3.3 Descriptive Statistics – Measures of Central Tendency

Frequency tables, bar charts and histograms aim to summarize the whole sample distribution of
a variable. Next we consider descriptive statistics which summarize (describe) one feature of the
sample distribution in a single number: summary (descriptive) statistics.

We begin with measures of central tendency. These answer the question: where is the ‘center’ or
‘average’ of the distribution?

We consider the following measures of central tendency:


 mean (i.e. the average, sample mean or arithmetic mean)
 median
 mode.

Notation for Variables

In formula, a generic variable is denoted by a single letter. In this module, usually X. However,
any other letter (Y, W etc.) could also be used, as long as it is used consistently. A letter with a
subscript denotes a single observation of a variable.

We use Xi to denote the value of X for unit i, where i can take values 1, 2, 3, … , n, and n is the
sample size. Therefore, the n observations of X in the dataset (the sample) are X1, X2, X3, … ,Xn.
These can also be written as Xi, for i = 1, … , n.

41
The Sample Mean

The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common measure of
central tendency.

42
43
Why is the mean a good summary of the central tendency?

Consider the following small dataset:

The Sample Median

Let X(1), X(2), …, X(n) denote the sample values of X when ordered from the smallest to the largest,
known as the order statistics, such that:

 X(1) is the smallest observed value (the minimum) of X

 X(n) is the largest observed value (the maximum) of X.

The (sample) median, q50, of a variable X is the value which is ‘in the middle’ of the ordered
sample.

If n is odd, then q50 = X((n+1)/2).

44
For our country data n = 155, so q50 = X(78). From a table of frequencies, the median is the value
for which the cumulative percentage first reaches 50% (or, if a cumulative % is exactly 50%, the
average of the corresponding value of X and the next highest value).

The ordered values of the level of democracy are:

For the level of democracy, the median is 6.

45
The median can be determined from the frequency table of the level of democracy:

Sensitivity to Outliers

For the following small ordered dataset, the mean and median are both 4:

1, 2, 4, 5, 8

Suppose we add one observation to get the ordered sample:

1, 2, 4, 5, 8, 100

The median is now 4.5, and the mean is 20.

In general, the mean is affected much more than the median by outliers, i.e. unusually small or
large observations. Therefore, you should identify outliers early on and investigate them –
perhaps there has been a data entry error, which can simply be corrected. If deemed genuine
outliers, a decision has to be made about whether or not to remove them.

Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the longer tail
of the sample distribution.

 For a positively-skewed distribution, the mean is larger than the median.


 For a negatively-skewed distribution, the mean is smaller than the median.
 For an exactly symmetric distribution, the mean and median are equal.

46
The Sample Mode

The (sample) mode of a variable is the value which has the highest frequency (i.e. appears most
often) in the data.

For our country data, the modal region is 1 (Africa) and the mode of the level of democracy is 0.

The mode is not very useful for continuous variables which have many different values, such as
GDP per capita.

A variable can have several modes (i.e. be multimodal). For example, GDP per capita has modes
0.8 and 1.9, both with 5 countries out of the 155. The mode is the only measure of central tendency
which can be used even when the values of a variable have no ordering, such as for the (nominal)
region variable.

3.4 Descriptive Statistics – Measures of Spread

Central tendency is not the whole story. The two sample distributions shown below have the same
mean, but they are clearly not the same. In one (red) the values have more dispersion (variation)
than in the other.

One might imagine these represent the sample distributions of the daily returns of two stocks,
both with a mean of 0%.

The black stock exhibits a smaller variation, hence we may view this as a safer stock – although
there is little chance of a large positive daily return, there is equally little chance of a small negative
daily return. In contrast, the red stock would be classified as a riskier stock – now there is a non-
negligible chance of a large positive daily return, however this coincides with an equally non-
negligible chance of a large negative daily return, i.e. a loss.

47
Example: A small example determining the sum of the squared deviations from the (sample)
mean, used to calculate common measures of dispersion.

Variance and Standard Deviation

48
Example: Consider the following simple data set:

Sample Quartiles

The median, q50, is basically the value which divides the sample into the smallest 50% of
observations and the largest 50%. If we consider other percentage splits, we get other (sample)
quartiles (percentiles), qc. Some special quartiles are given below.

 The first quartile, q25 or Q1, is the value which divides the sample into the smallest 25% of
observations and the largest 75%.
 The third quartile, q75 or Q3, gives the 75% – 25% split.
 The extremes in this spirit are the minimum, X(1) (the ‘0% quartile’, so to speak), and the
maximum, X(n) (the ‘100% quartile’).

These are no longer ‘in the middle’ of the sample, but they are more general measures of location
of the sample distribution. Two measures based on quartile-type statistics are the:
 range: X(n) – X(1) = maximum – minimum
 interquartile range (IQR): IQR = q75 – q25 = Q3 – Q1.

The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes
of the distribution, i.e. the minimum and maximum observations. The IQR focuses on the middle
50% of the distribution, so it is completely insensitive to outliers.

49
Boxplots

A boxplot (in full, a box-and-whiskers plot) summarizes some key features of a sample
distribution using quartiles. The plot is comprised of the following.
 The line inside the box, which is the median.
 The box, whose edges are the first and third quartiles (Q1 and Q3). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the interquartile
range.
 The bottom whisker extends either to the minimum or up to a length of 1.5 times the
interquartile range below the first quartile, whichever is closer to the first quartile.
 The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.
 Points beyond 1.5 times the interquartile range below the first quartile or above the third
quartile are regarded as outliers, and plotted as individual points.

A much longer whisker (and/or outliers) in one direction relative to the other indicates a skewed
distribution, as does a median line not in the middle of the box. The boxplot below is of GDP per
capita using the sample of 155 countries.

3.5 The Normal Distribution

The normal distribution is by far the most important probability distribution in statistics.
This is for three broad reasons.

 Many variables have distributions which are approximately normal, for example heights of
humans or animals, and weights of various products.

 The normal distribution has extremely convenient mathematical properties, which make it
a useful default choice of distribution in many contexts.

50
 Even when a variable is not itself even approximately normally distributed, functions of
several observations of the variable (sampling distributions) are often approximately
normal, due to the central limit theorem (covered in Module 5.5). Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed later in
the course.

The figure below shows three normal distributions with different means and/or variances.

51
 N(0, 1) and N(5, 1) have the same dispersion but different location: the N(5, 1) curve is
identical to the N(0, 1) curve, but shifted 5 units to the right.
 N(0, 1) and N(0, 9) have the same location but different dispersion: the N(0, 9) curve is
centered at the same value, 0, as the N(0, 1) curve, but spread out more widely.

Linear Transformations of the Normal Distribution

We now consider one of the convenient properties of the normal distribution. Suppose X is a
random variable, and we consider the linear transformation Y = aX + b, where a and b are
constants.

52
3.6 Variance of Random Variables

One very important average associated with a distribution is the expected value of the square of
the deviation of the random variable from its mean, 𝜇. This can be seen to be a measure – not
the only one, but the most widely used by far – of the dispersion of the distribution and is known
as the variance of the random variable. We distinguish between two different types of variance:

 the sample variance, S2, which is a measure of the dispersion in a sample dataset
 the population variance, Var(X) = 𝜎2, which reflects the variance of the whole population,
i.e. the variance of a probability distribution.

We have previously defined the sample variance as:

In essence, this is simply an average. Specifically, the average squared deviation of the data
about the sample mean (The division by n – 1, rather than by n, ensures that the sample variance
estimates the population variance correctly on average – known as an ‘unbiased estimator’). We
define the population variance in an analogous way, i.e. we define it to be the average squared
deviation about the population mean.

Recall that the population mean is a probability-weighted average:

53
and this represents the dispersion of a (discrete) probability distribution.

Example:

Returning to the example of a fair die, we had the following probability distribution:

Probabilities for Any Normal Distribution

54
Example:

55
Some Probabilities Around the Mean

The following results hold for all normal distributions:

56
The first two of these are illustrated graphically in the figure below.

57
ACTIVITIES/ASSESSMENT:

Directions: Solve each problem and write the complete solution.

Given: Four observations are obtained: 7, 9, 10 and 11. For these four values, derive the following:

1. The sample mean

2. The sample median

3. Show that ∑4𝑖=1(𝑥𝑖 − 𝑥̅ ) = 0

4. Compute the sample variance

5. Compute the sample standard deviation.

6. State how many of the four observations lie in the interval 𝑋̅ − 2 × 𝑆 and 𝑋̅ + 2 × 𝑆

Watch:

Scales of Measurement - Nominal, Ordinal, Interval, Ratio (Part 1) - Introductory Statistics


https://www.youtube.com/watch?v=KIBZUk39ncI

Finding mean, median, and mode | Descriptive statistics | Probability and Statistics | Khan
Academy
https://www.youtube.com/watch?v=k3aKKasOmIw

Normal Distribution - Explained Simply (part 1)


https://www.youtube.com/watch?v=xgQhefFOXrM

Expected Value and Variance of Discrete Random Variables


https://www.youtube.com/watch?v=OvTEhNL96v0

58
MODULE 4 – INFERENTIAL STATISTICS

OVERVIEW:
Statistical inference involves inferring unknown characteristics of a population based on
observed sample data. We begin with aspects of estimation. Now inference itself, we could sort
of subdivide into two main branches. Firstly, our estimation of focus for this module four. And
secondly, hypothesis testing our focus in module five. Now conceptually, what are we trying to
achieve? Well, this word inference means to infer something about a wider population based on
an observed sample of data.
So really when we do statistical analysis, the data we observed, we tend to view as a
sample drawn from some wider population. Now this word population in the everyday use of the
term may refer to perhaps the population of a country or maybe a city. Well, indeed, we may be
considering those particular types of populations in our statistical studies, but we are not confined
to that kind of simplistic definition of a population. Rather, a population doesn't necessarily even
have to refer to human beings. It maybe the population of companies who shares a listed on some
stock exchange. Maybe we're looking at the population of fish in the sea, planets and the universe,
you name it. Now at the heart of what we tried to do with our statistical inference is that, we
assume that our sample is fairly representative of that wider population. And our goal when
selecting a sample in the first place is to achieve hopefully this representativeness. Now
contextually, that may sound straightforward enough, but that's perhaps easier said than done.

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Summarize common data collection methods.
2. Explain what a sampling distribution is.
3. Discuss the principles of point interval and estimation.

COURSE MATERIALS:
4.1 Introduction to Sampling

Sampling is a key component of any research design. The key to the use of statistics in research
is being able to take data from a sample and make inferences about a large population. This idea
is depicted below.

59
Sampling design involves several basic questions.
 Should a sample be taken?
 If so, what process should be followed?
 What kind of sample should be taken?
 How large should it be?

We now consider how to answer these questions.

Sample or Census?

We introduce some important terminology.


 Population – The aggregate of all the elements, sharing some common set of
characteristics, which comprise the universe for the purpose of the problem being
investigated.
 Census – A complete enumeration of the elements of a population or study objects.
 Sample – A subgroup of the elements of the population selected for participation in the
study.

To determine whether a sample or a census should be conducted, various factors need to be


considered.
 A census is very costly, so a large budget would be required, whereas a small budget
favors a sample because fewer population elements are observed.
 The length of time available for the study is important – a sample is far quicker to collect.
 How large is the population? If it is `small', then it is feasible to conduct a census (it would
not be too costly nor too time-consuming). However, it might not be practical to enumerate
a ‘large’ population.
 We will be interested in some particular characteristic, such as the heights of a group of
adults. If there is a small variance of the characteristic of interest, then population elements
are ‘similar’, so we only need to observe a few elements to have a clear idea about the
characteristic. If the variance is large, then a sample may fail to capture the large
dispersion in the population, hence a census would be more appropriate.
 Sampling errors occur when the sample fails to adequately represent the population. If the
consequences of making sampling errors are extreme (i.e. the ‘cost’ is high), then a
census would appeal more since it eliminates sampling errors completely.
 If non-sampling errors are costly (for example, an interviewer incorrectly questioning
respondents) then a sample is better because fewer resources would have been spent on
collecting the data.
 Measuring sampled elements may result in the destruction of the object, such as testing
the road-life of a tire. Clearly, in such cases a census is not feasible as there would be no
tires left to sell!
 Sometimes we may wish to perform an in-depth interview to study elements in great detail.
If we want to focus on detail, then time and budget constraints would favor a sample.

The conditions which favor the use of a sample or census are summarized in the table below. Of
course, in practice, some of our factors may favor a sample while others favor a census, in which
case a balanced judgment is required.

60
Classification of Sampling Techniques

We draw a sample from the target population, which is the collection of elements or objects
which possess the information sought by the researcher and about which inferences are to be
made. We now consider the different types of sampling techniques which can be used in practice,
which can be decomposed into non-probability sampling techniques and probability
sampling techniques.

Non-probability sampling techniques are characterized by the fact that some units in the
population do not have a chance of selection in the sample. Other individual units in the population
have an unknown probability of being selected. There is also an inability to measure sampling
error. Examples of such techniques are:

 convenience sampling
 judgmental sampling
 quota sampling
 snowball sampling.

We now consider each of the listed techniques, explaining their strengths and weaknesses. To
illustrate each, we will use the example of 25 students (labelled ‘1’ to ‘25’) who happen to be in a
particular class (labelled ‘A’ to ‘E’) as follows:

Convenience Sampling

Convenience sampling attempts to obtain a sample of convenient elements (hence the name).
Often, respondents are selected because they happen to be in the right place at the right time.
Examples include using students and members of social organizations; also ‘people-in-the-street’
interviews.

61
Suppose class D happens to assemble at a convenient time and place, so all elements (students)
in this class are selected. The resulting sample consists of students 16, 17, 18, 19 and 20. Note
in this case there are no students selected from classes A, B, C and E.

Strengths of convenience sampling include being the cheapest, quickest and most convenient
form of sampling. Weaknesses include selection bias and lack of a representative sample.

Judgmental Sampling

Judgmental sampling is a form of convenience sampling in which the population elements are
selected based on the judgment of the researcher. Examples include purchase engineers being
selected in industrial market research; also expert witnesses used in court. Suppose a researcher
believes classes B, C and E to be ‘typical’ and ‘convenient’. Within each of these classes one or
two students are selected based on typicality and convenience. The resulting sample here
consists of students 8, 10, 11, 13 and 24. Note in this case there are no students selected from
classes A and D.

Judgmental sampling is achieved at low cost, is convenient, not particularly time-consuming and
good for ‘exploratory’ research designs. However, it does not allow generalizations and is
subjective due to the judgment of the researcher.

Quota Sampling

Quota sampling may be viewed as two-stage restricted judgmental sampling. The first stage
consists of developing control categories, or quota controls, of population elements. In the second
stage, sample elements are selected based on convenience or judgment. Suppose a quota of
one student from each class is imposed. Within each class, one student is selected based on
judgment or convenience. The resulting sample consists of students 3, 6, 13, 20 and 22.

62
Quota sampling is advantageous in that a sample can be controlled for certain characteristics.
However, it suffers from selection bias and there is no guarantee of representativeness of the
sample.

Snowball Sampling

In snowball sampling an initial group of respondents is selected, usually at random. After being
interviewed, these respondents are asked to identify others who belong to the target population
of interest. Subsequent respondents are selected based on these referrals. Suppose students 2
and 9 are selected randomly from classes A and B. Student 2 refers students 12 and 13, while
student 9 refers student 18. The resulting sample consists of students 2, 9, 12, 13 and 18. Note
in this case there are no students from class E included in the sample.

Snowball sampling has the major advantage of being able to increase the chance of locating the
desired characteristic in the population and is also fairly cheap. However, it can be time-
consuming.

4.2 Random Sampling

We have previously seen that the term target population represents the collection of units
(people, objects etc.) in which we are interested. In the absence of time and budgetary constraints
we conduct a census, that is a total enumeration of the population. Its advantage is that there is
no sampling error because all population units are observed and so there is no estimation of
population parameters. Due to the large size, N, of most populations, an obvious disadvantage
with a census is cost, so it is often not feasible in practice. Even with a census non-sampling error
may occur, for example if we have to resort to using cheaper (hence less reliable) interviewers
who may erroneously record data, misunderstand a respondent etc.

63
So we select a sample, that is a certain number of population members are selected and studied.
The selected members are known as elementary sampling units. Sample surveys (hereafter
‘surveys’) are how new data are collected on a population and tend to be based on samples rather
than a census. Selected respondents may be contacted in a variety of methods such as face-to-
face interviews, telephone, mail or email questionnaires.

Sampling error will occur (since not all population units are observed). However, non-sampling
error should be less since resources can be used to ensure high quality interviews or to check
completed questionnaires.

Types of Error

Several potential sources of error can affect a research design which we do our utmost to control.
The ‘total error’ represents the variation between the true value of a parameter in the population
of the variable of interest (such as a population mean) and the observed value obtained from the
sample. Total error is composed of two distinct types of error in sampling design.
 Sampling error occurs as a result of us selecting a sample, rather than performing a
census (where a total enumeration of the population is undertaken).
- It is attributable to random variation due to the sampling scheme used.
- For probability sampling, we can estimate the statistical properties of the
sampling error, i.e. we can compute (estimated) standard errors which
facilitate the use of hypothesis testing and the construction of confidence
intervals.
 Non-sampling error is a result of (inevitable) failures of the sampling scheme. In practice
it is very difficult to quantify this sort of error, typically through separate investigation. We
distinguish between two sorts of non-sampling error:
- Selection bias – this may be due to (1) the sampling frame not being equal
to the target population, or (2) in cases where the sampling scheme is not
strictly adhered to, or (3) non-response bias.
- Response bias – the actual measurements might be wrong, for example
ambiguous question wording, misunderstanding of a word in a
questionnaire, or sensitivity of information which is sought. Interviewer bias
is another aspect of this, where the interaction between the interviewer and
interviewee influences the response given in some way, either intentionally
or unintentionally, such as through leading questions, the dislike of a
particular social group by the interviewer, the interviewer's manner or lack
of training, or perhaps the loss of a batch of questionnaires from one local
post office. These could all occur in an unplanned way and bias your survey
badly.

Both kinds of error can be controlled or allowed for more effectively by a pilot survey. A pilot
survey is used:
 to find the standard error which can be attached to different kinds of questions and hence
to underpin the sampling design chosen
 to sort out non-sampling questions, such as:
- do people understand the questionnaires?
- are our interviewers working well?
- are there particular organizational problems associated with this enquiry?

64
Probability Sampling

Probability sampling techniques mean every population element has a known, non-zero
probability of being selected in the sample. Probability sampling makes it possible to estimate the
margins of sampling error, therefore all statistical techniques (such as confidence intervals and
hypothesis testing – covered later in the module) can be applied.

In order to perform probability sampling, we need a sampling frame which is a list of all population
elements. However, we need to consider whether the sampling frame is:
i. Adequate (does it represent the target population?)
ii. Complete (are there any missing units, or duplications?)
iii. Accurate (are we researching dynamic populations?)
iv. Convenient (is the sampling frame readily accessible?).

Examples of probability sampling techniques are:


 simple random sampling
 systematic sampling
 stratified sampling
 cluster sampling
 multistage sampling.

In this and Section 4.3 we illustrate each of these techniques using the same example from
Section 4.1, i.e. we consider a class of 25 students, numbered 1 to 25, spread across five classes,
A to E.

Simple Random Sampling (SRS)

In a simple random sample each element in the population has a known and equal probability of
selection. Each possible sample of a given size, n, has a known and equal probability of being
the sample which is actually selected. This implies that every element is selected independently
of every other element.

Suppose we select five random numbers (using a random number generator) from 1 to 25.
Suppose the random number generator returns 3, 7, 9, 16 and 24. The resulting sample therefore
consists of students 3, 7, 9, 16 and 24. Note in this case there are no students from class C.

SRS is simple to understand and results are readily projectable. However, there may be difficulty
constructing the sampling frame, lower precision (relative to other probability sampling methods)
and there is no guarantee of sample representativeness.

65
4.3 Further Random Sampling

Systematic Sampling

In systematic sampling, the sample is chosen by selecting a random starting point and then
picking every ith element in succession from the sampling frame. The sampling interval, i, is
determined by dividing the population size, N, by the sample size, n, and rounding to the nearest
integer. When the ordering of the elements is related to the characteristic of interest, systematic
sampling increases the representativeness of the sample. If the ordering of the elements
produces a cyclical pattern, systematic sampling may actually decrease the representativeness
of the sample.

Example:

Systematic sampling may or may not increase representativeness – it depends on whether there
is any ‘ordering’ in the sampling frame. It is easier to implement relative to SRS.

Stratified Sampling

Stratified sampling is a two-step process in which the population is partitioned (divided up) into
subpopulations known as strata. The strata should be mutually exclusive and collectively
exhaustive in that every population element should be assigned to one and only one stratum and
no population elements should be omitted. Next, elements are selected from each stratum by a
random procedure, usually SRS. A major objective of stratified sampling is to increase the
precision of statistical inference without increasing cost.

66
The elements within a stratum should be as homogeneous as possible (i.e. as similar as possible),
but the elements between strata should be as heterogeneous as possible (i.e. as different as
possible). The stratification factors should also be closely related to the characteristic of
interest. Finally, the factors (variables) should decrease the cost of the stratification process by
being easy to measure and apply.

In proportionate stratified sampling, the size of the sample drawn from each stratum is
proportional to the relative size of that stratum in the total population. In disproportionate
(optimal) stratified sampling, the size of the sample from each stratum is proportional to the
relative size of that stratum and to the standard deviation of the distribution of the characteristic
of interest among all the elements in that stratum. Suppose we randomly select a number from 1
to 5 for each class (stratum) A to E. This might result, say, in the stratified sample consisting of
students 4, 7, 13, 19 and 21. Note in this case one student is selected from each class.

Stratified sampling includes all important subpopulations and ensures a high level of precision.
However, sometimes it might be difficult to select relevant stratification factors and the
stratification process itself might not be feasible in practice if it was not known to which stratum
each population element belonged.

Cluster Sampling

In cluster sampling the target population is first divided into mutually exclusive and collectively
exhaustive subpopulations known as clusters. A random sample of clusters is then selected,
based on a probability sampling technique such as SRS. For each selected cluster, either all the
elements are included in the sample (one-stage cluster sampling), or a sample of elements is
drawn probabilistically (two-stage cluster sampling).

Elements within a cluster should be as heterogeneous as possible, but clusters themselves


should be as homogeneous as possible. Ideally, each cluster should be a small-scale
representation of the population. In probability proportionate to size sampling, the clusters are
sampled with probability proportional to size. In the second stage, the probability of selecting a
sampling unit in a selected cluster varies inversely with the size of the cluster.

Suppose we randomly select three clusters: B, D and E. Within each cluster, we randomly select
one or two elements. The resulting sample here consists of students 7, 18, 20, 21 and 23. Note
in this case there are no students selected from clusters A and C.

67
Cluster sampling is easy to implement and cost effective. However, the technique suffers from a
lack of precision and it can be difficult to compute and interpret results.

Multistage Sampling

In multistage sampling selection is performed at two or more successive stages. This technique
is often adopted in large surveys. At the first stage, large ‘compound’ units are sampled (primary
units), and several sampling stages of this type may be performed until we at last sample the
basic units.

The technique is commonly used in cluster sampling so that we are at first sampling the main
clusters, and then clusters within clusters etc. We can also use multistage sampling with mixed
techniques, i.e. cluster sampling at Stage 1 and stratified sampling at Stage 2 etc.

An example might be a national survey of salespeople in a company. Sales areas could be


identified from which a random selection is taken from these areas. Instead of interviewing every
person in the chosen clusters (which would be a one-stage cluster sample), only randomly
selected salespeople within the chosen clusters will be interviewed.

4.4 Sampling Distributions

A simple random sample is a sample selected by a process where every possible sample (of
the same size, n) has the same probability of selection. The selection process is left to chance,
therefore eliminating the effect of selection bias. Due to the random selection mechanism, we
do not know (in advance) which sample will occur. Every population element has a known, non-
zero probability of selection in the sample, but no element is certain to appear.

Example:

Consider a population of size N = 6 elements: A, B, C, D, E and F.

We consider all possible samples of size n = 2 (without replacement, i.e. once an object has been
chosen it cannot be selected again).

There are 15 different, but equally likely, such samples:

AB, AC, AD, AE, AF, BC, BD, BE,


BF, CD, CE, CF, DE, DF, EF.

Since this is a simple random sample, each sample has a probability of selection of 1/15.

A population has particular characteristics of interest such as the mean, 𝜇, and variance, 𝜎2.
Collectively, we refer to these characteristics as parameters. If we do not have population data,
the parameter values will be unknown.

‘Statistical inference’ is the process of estimating the (unknown) parameter values using the
(known) sample data.

68
We use a statistic (called an estimator) calculated from sample observations to provide a point
estimate of a parameter.

69
4.5 Sampling distribution of the sample mean

Like any distribution, we care about a sampling distribution's mean and variance. Together, we
can assess how ‘good’ an estimator is. First, consider the mean. We seek an estimator which
does not mislead us systematically. So the ‘average’ (mean) value of an estimator, over all
possible samples, should be equal to the population parameter itself.

70
Returning to our example:

An important difference between a sampling distribution and other distributions is that the values
in a sampling distribution are summary measures of whole samples (i.e. statistics, or estimators)
rather than individual observations.

Formally, the mean of a sampling distribution is called the expected value of the estimator,
denoted by E(∙).

Hence the expected value of the sample mean is E(𝑋̅).

An unbiased estimator has its expected value equal to the parameter being estimated. For our
example, E(𝑋̅) = 6 = 𝜇.

Fortunately, the sample mean 𝑋̅ is always an unbiased estimator of 𝜇 in simple random sampling,
regardless of the:
 sample size, n
 distribution of the (parent) population.

This is a good illustration of a population parameter (here, 𝜇) being estimated by its sample
counterpart (here, 𝑋̅).

The unbiasedness of an estimator is clearly desirable. However, we also need to take into account
the dispersion of the estimator's sampling distribution. Ideally, the possible values of the estimator
should not vary much around the true parameter value. So, we seek an estimator with a small
variance.

Recall the variance is defined to be the mean of the squared deviations about the mean of the
distribution. In the case of sampling distributions, it is referred to as the sampling variance.

71
Returning to our example:

Hence the sampling variance is 24/15 = 1.6

The population itself has a variance, the population variance, 𝜎2.

Hence the population variance is 𝜎2 = 24/6 = 4

We now consider the relationship between 𝜎2 and the sampling variance. Intuitively, a larger 𝜎2
should lead to a larger sampling variance. For population size N and sample size n, we note the
following result when sampling without replacement:

So for our example, we get:

We use the term standard error to refer to the standard deviation of the sampling distribution,
so:

72
Some implications are the following:
 As the sample size n increases, the sampling variance decreases, i.e. the precision
increases.
 Provided the sampling fraction, n / N, is small, the term:

𝑁−𝑛
≈1
𝑁−1

so can be ignored. Therefore, the precision depends effectively on n only.

Returning to our example, the larger the sample, the less variability there will be between samples.

We can see that there is a striking improvement in the precision of the estimator, because the
variability has decreased considerably.

The range of possible 𝑥̅ values goes from 3.5 to 8.0 down to 5.0 to 7.25. The sampling variance
is reduced from 1.6 to 0.4.

The factor (N – n)/(N – 1) decreases steadily as n→N. When n = 1 the factor equals 1, and when
n = N it equals 0.

When sampling without replacement, increasing n must increase precision since less of the
population is left out. In much practical sampling N is very large (for example, several million),
while n is comparably small (at most 1,000, say).

Therefore, in such cases the factor (N – n)/(N – 1) is close to 1, hence:

73
Example:

Sampling from the Normal Distribution

74
Example:

4.6 Confidence Intervals

A point estimate (such as a sample mean, 𝑥̅ ) is our ‘best guess’ of an unknown population
parameter (such as a population mean, 𝜇) based on sample data. Although:
𝐸(𝑋̅) = 𝜇
meaning that on average the sample mean is equal to the population mean, as it is based on a
sample there is some uncertainty (imprecision) in the accuracy of the estimate. Different random
samples would tend to lead to different observed sample means. Confidence intervals
communicate the level of imprecision by converting a point estimate into an interval estimate.

75
Formally, an x% confidence interval covers the unknown parameter with x% probability over
repeated samples. The shorter the confidence interval, the more reliable the estimate.

As we shall see, this is achievable by:


 reducing the level of confidence (undesirable)
 increasing the sample size (costly).

If we assume we have either (i) known 𝜎, or (ii) unknown 𝜎 but a large sample size, say n ≥ 50,
then the formula for the endpoints of a confidence interval for a single mean are:

Here 𝑥̅ is the sample mean, 𝜎 is the population standard deviation, s is the sample standard
deviation, n is the sample size and z is the confidence coefficient, reflecting the confidence
level.

Influences on the Margin of Error

More simply, we can view the confidence interval for a mean as:
best guess ± margin of error

where 𝑥̅ is the best guess, and the margin of error is:

Therefore, we see that there are three influences on the size of the margin of error (and hence
on the width of the confidence interval). Specifically:

76
Confidence Coefficients

For a 95% confidence interval, z = 1.96, leading to:

Other levels of confidence pose no problem, but require a different confidence coefficient. For
large n, we obtain this coefficient from the standard normal distribution.

 For 90% confidence, use the confidence coefficient z = 1.645


 For 95% confidence, use the confidence coefficient z = 1.960
 For 99% confidence, use the confidence coefficient z = 2.576

Example:

77
ACTIVITIES/ASSESSMENT:

Direction: Choose the letter of the best answer.

1. Quota sampling requires quota controls.


a. True
b. False

2. Selection bias is an example of:


a. Sampling Error
b. Non-sampling Error

3. A sampling frame is required for:


a. Probability Sampling
b. Non-probability Sampling

4. In systematic sampling, N/n, is referred to as:


a. the sampling interval
b. the sampling fraction

5. In stratified random sampling, elements within a stratum should be:


a. as heterogeneous as possible
b. as homogenous as possible

6. A sampling distribution is the:


a. probability (frequency) distribution of the parameter
b. probability (frequency) distribution of statistics

7. An unbiased estimator has its expected value equal to the:


a. parameter being estimated
b. statistic used as estimator

8. Confidence intervals convert:


a. interval estimates into point estimates
b. point estimates into interval estimates

9. Which of the following does not influence the size of the margin of error when considering the
confidence interval for a population mean?
a. the sample size, n
b. the sample mean, 𝑥̅
c. the standard deviation, 𝜎 or s

10. Other things equal, as the confidence level increases the margin of error:
a. decreases
b. stays the same
c. increases

11. Non-probability sampling techniques, such as convenience and quota sampling, suffer from
selection bias.
a. True
b. False

78
12. In probability sampling:
a. there is a known, non-zero probability of a population element being selected.
b. some population elements have a zero probability of being selected, while others have
an unknown probability of being selected.

13. In stratified random sampling, elements between strata should be:


a. as heterogeneous as possible
b. as homogeneous as possible

14. We use a statistic (called an estimator) calculated from sample observations to provide a point
estimate of a:
a. parameter
b. random variable

Watch:

Types of Sampling Methods


https://www.youtube.com/watch?v=pTuj57uXWlk

Sampling: Simple Random, Convenience, systematic, cluster, stratified - Statistics Help


https://www.youtube.com/watch?v=be9e-Q-jC-0

Confidence Interval Explained (Calculation and Interpretation)


https://www.youtube.com/watch?v=w3tM-PMThXk

Introduction to Confidence Interval


https://www.youtube.com/watch?v=27iSnzss2wM

79
MODULE 5 – HYPOTHESIS TESTING

OVERVIEW:
We continue statistical inference with an examination of the fundamentals of hypothesis
testing – testing a claim or theory about a population parameter. Can we find evidence to support
or refute a claim or theory? So think of hypothesis testing as about simple decision theory. Namely
we are going to be choosing between two competing statements and it's going to be a binary
choice such that based on some data, based on some evidence we are going to conclude either
in favor of one statement or hypothesis or the other. So I mentioned this legal or judicial analogy
and this is where we will begin because I'm sure many, if not all of you, are familiar with the
concept of a courtroom and a jury.

So let's imagine the following scenario. Let's suppose that I've been a naughty boy. So
naughty in fact the police have arrested me on suspicion of committing some crime. But let's
imagine the police have arrested me on suspicion, let's say for murder. So, I'd like you to imagine
now that the police have done their investigation and now we are in the courtroom setting. I am
the defendant in this trial for murder and you are a member of the jury because basically a jury is
conducting a hypothesis test. They are choosing between two competing statements. They're
trying to determine whether the defendant is guilty or not guilty of this criminal offense. I'd like to
relay the statistical form of testing and apply it to this legal analogy. First of all, once we've done
that we'll then be in a position to consider more statistical versions of hypothesis testing. So, in
our statistical world of hypothesis testing we have two competing statements known as
hypotheses, a so called null hypothesis H0 and an alternative hypothesis H1. The jury would set
the following hypotheses: H0 would be that the defendant is not guilty of the alleged crime and the
alternative hypothesis H1 is that the defendant is guilty of said crime. So, with the presumption
of innocence, the jury have to assume that the defendant is innocent, i.e. not guilty of the crime
until the evidence becomes sufficiently overwhelming that being not guilty is an unlikely scenario
and hence the jury would return a verdict of guilty.

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Explain the underlying philosophy of hypothesis testing.
2. Distinguish the different inferential errors in testing
3. Conduct simple tests of common parameters.

COURSE MATERIALS:

5.1 Statistical Juries

Module 5 considers hypothesis testing, i.e. decision theory whereby we make a binary decision
between two competing hypotheses:

H0 = the null hypothesis and H1 = the alternative hypothesis

80
The binary decision is whether to ‘reject H0’ or ‘fail to reject H0’. Before we consider statistical
tests, we begin with a legal analogy – the decision of a jury in a court trial.

Example:

In a criminal court, defendants are put on trial because the police suspect they are guilty of a
crime. Of course, the police are biased due to their suspicion of guilt so determination of whether
a defendant is guilty or not guilty is undertaken by an independent (and hopefully objective) jury.

The jury has to decide between the two competing hypotheses:

H0: not guilty and H1: guilty

In most jury-based legal systems around the world there is the ‘presumption of innocence until
proven guilty’. This equates to the jury initially believing H0, which is the working hypothesis. A
jury must continue to believe in the null hypothesis until they feel the evidence presented to the
court proves guilt ‘beyond a reasonable doubt’, which represents the burden of proof required to
establish guilt. In our statistical world of hypothesis testing, this will be known as the significance
level, i.e. the amount of evidence needed to reject H0.

The jury uses the following decision rule to make a judgment. If the evidence is:
 sufficiently inconsistent with the defendant being not guilty, then reject the null hypothesis
(i.e. convict)
 not indicating guilt beyond a reasonable doubt, then fail to reject the null hypothesis – note
that failing to prove guilt does not prove that the defendant is innocent!

Statistical hypothesis testing follows this same logical path.

Miscarriages of Justice

In a perfect world juries would always convict the guilty and acquit the innocent. Sadly, it is not a
perfect world and so sometimes juries reach incorrect decisions, i.e. convict the innocent and
acquit the guilty. One hopes juries get it right far more often than they get it wrong, but this is an
important reminder that miscarriages of justice do occur from time to time, demonstrating that the
jury system is not infallible!

Statistical hypothesis testing also risks making mistakes which we will formally define as Type I
errors and Type II errors in Module 5.2.

Note: The jury is not testing whether the defendant is guilty, rather the jury is testing the hypothesis
of not guilty. Failure to reject H0 does not prove innocence, rather the jury concludes the evidence
is not sufficiently inconsistent with H0 to indicate guilt beyond a reasonable doubt. Admittedly,
what constitutes a ‘reasonable doubt’ is subjective which is why juries do not always reach a
unanimous verdict.

81
5.2 Type I and Type II Errors

In any hypothesis test there are two types of inferential decision error which could be committed.

Clearly, we would like to reduce the probabilities of these errors as much as possible. These two
types of error are called a Type I error and a Type II error.

 Type I error: rejecting H0 when it is true. This can be thought of as a ‘false positive’.
Denote the probability of this type of error by 𝛼.
 Type II error: failing to reject H0 when it is false. This can be thought of as a ‘false
negative’. Denote the probability of this type of error by 𝛽.

Both errors are undesirable and, depending on the context of the hypothesis test, it could be
argued that either one is worse than the other.

However, on balance, a Type I error is usually considered to be more problematic. (Thinking back
to trials by jury, conventional wisdom is that it is better to let 100 guilty people walk free than to
convict a single innocent person. While you are welcome to disagree, this view is consistent with
Type I errors being more problematic.)

The possible decision space can be presented as:

For example, if H0 was being ‘not guilty’ and H1 was being ‘guilty’, a Type I error would be finding
an innocent person guilty (bad for him/her), while a Type II error would be finding a guilty person
innocent (bad for the victim/society, but admittedly good for him/her).

The complement of a Type II error, that is 1 − 𝛽, is called the power of the test – the probability
that the test will reject a false null hypothesis. Hence power measures the ability of the test to
reject a false H0, and so we seek the most powerful test for any testing situation.

Unlike 𝛼, we do not control test power.However, we can increase it by increasing the sample size,
n (a larger sample size will inevitably improve the accuracy of our statistical inference).

These concepts can be summarized as conditional probabilities.

We have:

82
Other things equal, if you decrease 𝛼 you increase 𝛽 and vice-versa. Hence there is a trade-off.

Significance Level

Since we control for the probability of a Type I error, 𝛼, what value should this be?

Well, in general we test at the 100𝛼% significance level, for 𝛼 ∈ [0, 1]. The default choice is 𝛼 =
0.05, i.e. we test at the 5% significance level. Of course, this value of 𝛼 is subjective, and a
different significance level may be chosen. The severity of a Type I error in the context of a specific
hypothesis test might for example justify a more conservative or liberal choice for 𝛼.

In fact, noting our look at confidence intervals in Module 4.6, we could view the significance level
as the complement of the confidence level. (Strictly speaking, this would apply to so-called ‘two-
tailed’ hypothesis tests.)

For example:
 a 90% confidence level equates to a 10% significance level
 a 95% confidence level equates to a 5% significance level
 a 99% confidence level equates to a 1% significance level.

5.3 P-values, Effect Size and Sample Size Influences

We introduce p-values, which are our principal tool for deciding whether or not to reject H0.

A p-value is the probability of the event that the ‘test statistic’ takes the observed value or more
extreme (i.e. more unlikely) values under H0. It is a measure of the discrepancy between the
hypothesis H0 and the data evidence.

 A ‘small’ p-value indicates that H0 is not supported by the data.


 A ‘large’ p-value indicates that H0 is not inconsistent with the data.

So p-values may be seen as a risk measure of rejecting H0.

Example:

Suppose one is interested in evaluating the mean income (in $000s) of a community. Suppose
income in the population is modelled as N(𝜇, 25) and a random sample of n = 25 observations is
taken, yielding the sample mean 𝑥̅ = 17.

83
Independently of the data, three expert economists give their own opinions as follows.
 Dr. A claims the mean income is 𝜇 = 16.
 Ms. B claims the mean income is 𝜇 = 15.
 Mr. C claims the mean income is 𝜇 = 14.

How would you assess these experts' statements?

Here, 𝑋̅~ 𝑁 (𝜇, 𝜎 2 /𝑛) = 𝑁(𝜇, 1). We assess the statements based on this distribution.

If Dr. A's claim is correct, 𝑋̅~ 𝑁 (16, 1). The observed value 𝑥̅ = 17 is one standard deviation away
from 𝜇, and may be regarded as a typical observation from the distribution. Hence there is little
inconsistency between the claim and the data evidence. This is shown below:

If Ms. B's claim is correct, 𝑋̅~ 𝑁 (15, 1). The observed value 𝑥̅ = 17 begins to look a bit ‘extreme’,
as it is two standard deviations away from 𝜇. Hence there is some inconsistency between the
claim and the data evidence. This is shown below:

84
If Mr. C's claim is correct, 𝑋̅~ 𝑁 (14, 1). The observed value 𝑥̅ = 17 is very extreme, as it is three
standard deviations away from 𝜇. Hence there is strong inconsistency between the claim and the
data evidence. This is shown below:

It follows that:

 Under H0: 𝜇 = 16, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 15) = 𝑃(|𝑋̅ − 16| ≥ 1) = 0.3173
 Under H0: 𝜇 = 15, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 13) = 𝑃(|𝑋̅ − 15| ≥ 2) = 0.0455
 Under H0: 𝜇 = 14, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 11) = 𝑃(|𝑋̅ − 14| ≥ 3) = 0.0027

In summary, we reject the hypothesis 𝜇 = 15 or 𝜇 = 14, as, for example, if the hypothesis 𝜇 = 14
is true, the probability of observing 𝑥̅ = 17, or more extreme values, would be as small as 0.003.
We are comfortable with this decision, as a small probability event would be very unlikely to occur
in a single experiment.

On the other hand, we cannot reject the hypothesis 𝜇 = 16. However, this does not imply that this
hypothesis is necessarily true, as, for example, 𝜇 = 17 or 18 are at least as likely as 𝜇 = 16.

Remember:
not reject ≠ accept.

A statistical test is incapable of ‘accepting’ a hypothesis.

Interpretation of p-values

In practice the statistical analysis of data is performed by computers using statistical or


econometric software packages. Regardless of the specific hypothesis being tested, the
execution of a hypothesis test by a computer returns a p-value. Fortunately, there is a universal
decision rule for p-values.

Module 5.2 explained that we control for the probability of a Type I error through our choice of
significance level, 𝛼, where 𝛼 ∈ [0, 1]. Since p-values are also probabilities, as defined above, we
simply compare p-values with our chosen benchmark significance level, 𝛼.

85
The p-value decision rule is shown below for 𝛼 = 0.05

Our decision is to reject H0 if the p-value is ≤ 𝜶. Otherwise, H0 is not rejected.

Clearly, the magnitude of the p-value (compared with 𝛼) determines whether or not H0 is rejected.
Therefore, it is important to consider two key influences on the magnitude of the p-value: the
effect size and the sample size.

Effect Size Influence

The effect size reflects the difference between what you would expect to observe if the null
hypothesis is true and what is actually observed in a random experiment. Equality between our
expectation and observation would equate to a zero effect size, which (while not proof that H0 is
true) provides the most convincing evidence in favor of H0. As the difference between our
expectation and observation increases, the data evidence becomes increasingly inconsistent with
H0 making us more likely to reject H0. Hence as the effect size gets larger, the p-value gets
smaller (and so is more likely to be below 𝛼).

To illustrate this idea, consider the experiment of tossing a coin 100 times and observing the
number of heads. Quite rightly, you would not doubt the coin is fair (i.e. unbiased) if you observed
exactly 50 heads as this is what you would expect from a fair coin (50% of tosses would be
expected to be heads, and the other 50% tails). However, it is possible that you are:
 somewhat skeptical that the coin is fair if you observe 40 or 60 heads, say
 even more skeptical that the coin is fair if you observe 35 or 65 heads, say
 highly skeptical that the coin is fair if you observe 30 or 70 heads, say.

In this situation, the greater the difference between the number of heads and tails, the more
evidence you have that the coin is not fair.

In fact, if we test:
H0 : 𝜋 = 0.5 vs. H1 : 𝜋 ≠ 0.5

where 𝜋 = P(heads), for n = 100 tosses of the coin we would expect 50 heads and 50 tails. It can
be shown that for this fixed sample size the p-value is sensitive to the effect size (the difference
between the observed sample proportion of heads and the expected proportion of 0.5) as follows:

86
So we clearly see the inverse relationship between the effect size and the p-value. The above is
an example of a sensitivity analysis where we consider the pure influence of the effect size on the
p-value while controlling for (fixing) the sample size. We now proceed to control the effect size to
examine the sample size influence.

Sample Size Influence

Other things equal, a larger sample size should lead to a more representative random sample
and the characteristics of the sample should more closely resemble those of the population
distribution from which the sample is drawn.

In the context of the coin toss, this would mean the observed sample proportion of heads should
converge to the true probability of heads, 𝜋, as 𝑛 → ∞.

As such, we consider the sample size influence on the p-value. For a non-zero effect size (A zero
effect size would result in non-rejection of H0, regardless of n) the p-value decreases as the
sample size increases.

Continuing the coin toss example, let us fix the (absolute) effect size at 0.1, i.e. in each of the
following examples the observed sample proportion of heads differs by a fixed proportion of 0.1
(= 10%).

So we clearly see the inverse relationship between the sample size and the p-value.

In Module 5.2, we defined the power of the test as the probability that the test will reject a false
null hypothesis. In order to reject the null hypothesis it is necessary to have a sufficiently small p-
value (less than 𝛼), hence we see that we can unilaterally increase the power of a test by
increasing the sample size. Of course, the trade-off would be the increase in data collection costs.

Example:

87
5.4 Testing a Population Mean Claim

We consider the hypothesis test of a population mean in the context of a claim made by a
manufacturer.

As an example, the amount of water in mineral water bottles exhibits slight variations attributable
to the bottle-filling machine at the factory not putting in identical quantities of water in each bottle.
The labels on each bottle may state ‘500 ml’ but this equates to a claim about the average
contents of all bottles produced (in the population of bottles).

Let X denote the quantity of water in a bottle. It would seem reasonable to assume a normal
distribution for X such that:
𝑋 ~ 𝑁 (𝜇, 𝜎 2 )

And we wish to test:

H0 : 𝜇 = 500ml vs. H1 : 𝜇 ≠ 500ml

88
Suppose a random sample of n = 100 bottles is to be taken, and let us assume that 𝜎 = 10 ml.
From our work in Module 4.5 we know that:

𝜎2 (10)2
𝑋̅~ 𝑁 (𝜇, ) = 𝑁 (𝜇, ) = 𝑁(𝜇, 1)
𝑛 100

Further suppose that the sample mean in our random sample of 100 is 𝑥̅ = 503 ml. Clearly, we
see that:
𝑥̅ = 503 ≠ 500 = 𝜇

where 500 is the claimed value of 𝜇 being tested in H0.

The question is whether the difference between 𝑥̅ = 503 and the claim 𝜇 = 500 is:

(a) due to sampling error (and hence H0 is true)?


(b) statistically significant (and hence H1 is true)?

Determination of the p-value will allow us to choose between explanations (a) and (b). We proceed
by standardizing 𝑋̅ such that:

𝑋̅ − 𝜇
𝑍= ~ 𝑁(0, 1)
𝜎/√𝑛

acts as our test statistic. Note the test statistic includes the effect size, 𝑋̅ − 𝜇, as well as the
sample size, n.

Using our sample data, we now obtain the test statistic value (noting the influence of both the
effect size and the sample size, and hence ultimately the influence on the p-value):

503 − 500
=3
10/√100

The p-value is the probability of our test statistic value or a more extreme value conditional on
H0. Noting that H1: 𝜇 ≠ 500, ‘more extreme’ here means a z-score > 3 and < - 3. Due to the
symmetry of the standard normal distribution about zero, this can be expressed as:

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃(𝑍 ≥ |3|) = 0.0027

Note this value can easily be obtained using Microsoft Excel, say, as:

=NORM.S.DIST(-3)*2 or =(1-NORM.S.DIST(3))*2

where the function NORM.S.DIST(z) returns 𝑃(𝑍 ≤ 𝑧) for 𝑍~ 𝑁(0, 1).

Recall the p-value decision rule, shown below for 𝛼 = 0.05:

89
Therefore, since 0.0027 < 0.05 we reject H0 and conclude that the result is “statistically significant”
at the 5% significance level (and also, at the 1% significance level). Hence there is (strong)
evidence that 𝜇 ≠ 500. Since 𝑥̅ > 𝜇 we might go further and suppose that 𝜇 > 500.

Finally, recall the possible decision space:

As we have rejected H0 this means one of two things:


 we have correctly rejected H0
 we have committed a Type I error.

Although the p-value is very small, indicating it is highly unlikely that this is a Type I error,
unfortunately we cannot be certain which outcome has actually occurred.

5.5 The Central Limit Theorem

We have discussed (in Module 4.5) the very convenient result that if a random sample comes
from a normally-distributed population, the sampling distribution of 𝑋̅ is also normal. How about
sampling distributions of 𝑋̅ from other populations?

For this, we can use a remarkable mathematical result, the central limit theorem (CLT). In
essence, the CLT states that the normal sampling distribution of 𝑋̅ which holds exactly for random
samples from a normal distribution, also holds approximately for random samples from nearly any
distribution.

The CLT applies to ‘nearly any’ distribution because it requires that the variance of the population
distribution is finite. If it is not, the CLT does not hold. However, such distributions are not
common.

Suppose that {X1, X2, …, Xn} is a random sample from a population distribution which has mean
𝐸(𝑋𝑖 ) = 𝜇 < ∞ and variance 𝑉𝑎𝑟(𝑋𝑖 ) = 𝜎 2 < ∞, that is with a finite mean and finite variance. Let
𝑋̅𝑛 denote the sample mean calculated from a random sample of size n, then:

𝑋̅𝑛 − 𝜇
lim 𝑃 ( ≤ 𝑧) = Φ(𝑧)
𝑛→∞ 𝜎/√𝑛

for any 𝑧, where Φ(𝑧) = 𝑃(𝑍 ≤ 𝑧) denotes a cumulative probability of the standard normal
distribution.

The " lim " indicates that this is an asymptotic result, i.e. one which holds increasingly well as n
𝑛→∞
increases, and exactly when the sample size is infinite.

90
In less formal language, the CLT says that for a random sample from nearly any distribution with
mean 𝜇 and variance 𝜎 2 then:

𝜎2
𝑋̅~ 𝑁 (𝜇, )
𝑛

approximately, when 𝑛 is sufficiently large. We can then say that 𝑋̅ is asymptotically normally
𝜎2
distributed with mean 𝜇 and variance 𝑛
.

The Wide Reach of the CLT

It may appear that the CLT is still somewhat limited, in that it applies only to sample means
calculated from random samples. However, this is not really true, for two main reasons.
 There are more general versions of the CLT which do not require the observations 𝑋𝑖 to
be independent and identically distributed (IID).
 Even the basic version applies very widely, when we realise that the “X” can also be a
function of the original variables in the data. For example, if X and Y are random variables
in the sample, we can also apply the CLT to:
𝑛 𝑛
log(𝑋𝑖 ) 𝑋𝑖 𝑌𝑖
∑ 𝑜𝑟 ∑
𝑛 𝑛
𝑖=1 𝑖=1

Therefore, the CLT can also be used to derive sampling distributions for many statistics which do
not initially look at all like 𝑋̅ for a single random variable in a random sample. You may get to do
this in the next topics.

How large is “large n”?

The larger the sample size n, the better the normal approximation provided by the CLT is. In
practice, we have various rules-of-thumb for what is “large enough” for the approximation to be
“accurate enough”. This also depends on the population distribution of 𝑋𝑖. For example:
 for symmetric distributions, even small n is enough
 for very skewed distributions, larger n is required.

For many distributions, n > 50 is sufficient for the approximation to be reasonably accurate.

Example:

In the first case, we simulate random samples of sizes: n = 1, 5, 10, 30, 100 and 100 from the
Exponential (0.25) distribution (for which 𝜇 = 4 and 𝜎 2 = 16). This is clearly a skewed distribution,
as shown by the histogram for n = 1 in the figure below.

10,000 independent random samples of each size were generated. Histograms of the values of
𝑋̅ in these random samples are shown. Each plot also shows the approximating normal
distribution, N(4,16/n). The normal approximation is reasonably good already for n = 30, very
good for n = 100, and practically perfect for n = 1000.

91
Example:

In the second case, we simulate 10,000 independent random samples of sizes: n = 1, 10, 30, 50,
100 and 1000 from the Bernoulli (0.2) distribution (for which 𝜇 = 0.2 and 𝜎 2 = 0.16).

Here the distribution of 𝑋𝑖 itself is not even continuous, and has only two possible values, 0 and
1. Nevertheless, the sampling distribution of 𝑋̅ can be very well-approximated by the normal
distribution, when n is large enough.

Note that since here 𝑋𝑖 = 1 or 𝑋𝑖 = 0 for all 𝑖, 𝑋̅ = ∑𝑛𝑖=1 𝑋𝑖 /𝑛 = 𝑚/𝑛 where m is the number of
observations for which 𝑋𝑖 = 1. In other words, 𝑋̅ is the sample proportion of the value X = 1.

The normal approximation is clearly very bad for small n, but reasonably good already for n = 50,
as shown by the histograms below.

92
Note that as n increases:

𝜎2
 there is convergence to 𝑁(𝜇, 𝑛
)
 the sampling variance decreases (although the histograms might at first seems to show
the same variation, look closely at the scale on the x-axes).

Sampling Distribution of the Sample Proportion

The above example considered Bernoulli sampling where we noted that the sample mean was
the sample proportion of successes, which we now denote as P.

Since from the CLT:

𝜎2
𝑋̅ ~ 𝑁 ( 𝜇, )
𝑛

93
approximately, when n is sufficiently large, and noting that when X ~ Bernoulli(𝜋) then:

E(𝑋) = 0 × (1 − 𝜋) + 1 × 𝜋 = 𝜋 = 𝜇

and

Var (𝑋) = 𝜋(1 − 𝜋) = 𝜎 2

we have:

𝜎2 𝜋(1 − 𝜋)
𝑋̅ = 𝑃 → 𝑁 (𝜇, ) = 𝑁 (𝜋, )
𝑛 𝑛

as 𝑛 → ∞

We see that:

E(𝑋̅) = E(𝑃) = π

hence the sample proportion is equal to the population proportion, on average. Also:

Var(𝑃) → 0 as 𝑛 → ∞

so the sampling variance tends to zero as the sample size tends to infinity, as we see in the
histograms in the previous examples.

5.6 Proportions: Confidence Intervals and Hypothesis Testing

Recall the (approximate) sampling distribution of the sample proportion:

𝜎2 𝜋(1 − 𝜋)
𝑋̅ = 𝑃 → 𝑁 (𝜇, ) = 𝑁 (𝜋, )
𝑛 𝑛

as 𝑛 → ∞

We will now use this result to conduct statistical inference for proportions.

Confidence Intervals

In Module 4.6 we viewed a confidence interval for a mean as:

best guess ± margin of error

As the sample proportion is a special case of the sample mean, this construct continues to hold.
Here, the:
point estimate = p

94
where p is the observed sample proportion, and the:
margin of error = confidence coefficient × standard error

The confidence coefficient continues to be a z-value such that:


 for 90% confidence, use the confidence coefficient z = 1.645
 for 95% confidence, use the confidence coefficient z = 1.960
 for 99% confidence, use the confidence coefficient z = 2.576

while the (estimated) standard error is:

𝑝(1 − 𝑝)

𝑛

Therefore, a confidence interval for a proportion is given by:

𝑝(1 − 𝑝)
𝑝 ± 𝑧×√
𝑛

Example:

In opinion polling, sample sizes of about 1000 are used as this leads to a margin of error of
approximately three percentage points – deemed an acceptable tolerance on the estimation error
by most political scientists. Suppose 630 out of 1000 voters in a random sample said they would
vote ‘Yes’ in a binary referendum. The sample proportion is:

630
p= = 0.63
100

and a 95% confidence interval for 𝜋, the true proportion who would vote ‘Yes’ in the electoral
population, is:

0.63 × 0.37
0.63 ± 1.96 × √ = 0.63 ± 0.03 ⟹ (0.60, 0.66) or (60%, 66%)
1000

demonstrating the three percentage-point margin of error.

Hypothesis Testing

Suppose we wish to test:


H0: 𝜋 = 0.4 vs. H1: 𝜋 ≠ 0:4

and a random sample of n = 1000 returned a sample proportion of p = 0.44. To undertake this
test, we follow a similar approach to that outlined in Module 5.4.

95
We proceed by standardizing P such that:

𝑃−𝜋
𝑍= ~ 𝑁 (0, 1)
√𝜋(1 − 𝜋)/𝑛

Approximately for large 𝑛, which is satisfied here since 𝑛 = 1000. Note the test statistic includes
the effect size, 𝑃 − 𝜋, as well as the sample size, 𝑛.

Note: The true standard error of the sample proportion is 𝜋(1 − 𝜋)/𝑛. However, 𝜋 is unknown
(which is why we are estimating it via a confidence interval), hence we intuitively use the estimated
standard error which estimates 𝜋 with 𝑝.

Using our sample data, we now obtain the test statistic value (noting the influence of both the
effect size and the sample size, and hence ultimately the influence on the p-value):

0.44 − 0.4
= 2.58
√0.4 × (1 − 0.4)/1000

The p-value is the probability of our test statistic value or a more extreme value conditional on
H0. Noting that H1: 𝜋 ≠ 0.4, ‘more extreme’ here means a z-score > 2.58 and < -2.58. Due to the
symmetry of the standard normal distribution about zero, this can be expressed as:

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃(𝑍 ≥ |2.58|) = 0.0099

Note this value can easily be obtained using Microsoft Excel, say, as:

=NORM.S.DIST(-2.58)*2 or =(1-NORM.S.DIST(2.58))*2

where the function NORM.S.DIST(z) returns 𝑃(𝑍 ≤ 𝑧)for 𝑍~ 𝑁 (0, 1).

Recall the p-value decision rule, shown below for 𝛼 = 0.05

Therefore, since 0.0099 < 0.05 we reject H0 and conclude that the result is ‘statistically significant’
at the 5% significance level (and also, just, at the 1% significance level). Hence there is (strong)
evidence that 𝜋 ≠ 0.4. Since 𝑝 > 𝜋 we might go further and suppose that 𝜋 > 0.4.

96
Finally, recall the possible decision space:

As we have rejected H0 this means one of two things:


 we have correctly rejected H0
 we have committed a Type I error.

Although the p-value is very small, indicating it is highly unlikely that this is a Type I error,
unfortunately we cannot be certain which outcome has actually occurred.

ACTIVITIES/ASSESSMENT:

Hypothesis Testing:

You are to test the claim by a mineral water bottle manufacturer that its bottles contain an average
of 1000 ml (1 liter). A random sample of n=12 bottles resulted in the measurements (in ml):

992, 1002, 1000, 1001, 998, 999, 1000, 995, 1003, 1001, 997 and 997.

It is assumed that the true variance of water in all bottles is 𝜎 2 = 1.5, and that the amount of water
in bottles is normally distributed. Test the manufacturer's claim at the 1% significance level (you
may use Excel to calculate the p-value). Also, briefly comment on what the hypothesis test result
means about the manufacturer's claim, and if an error might have occurred which type of error it
would be.

In summary, this activity/assessment requires:

1. The calculation of the sample mean from the raw observations.


2. The formulation of the hypotheses, H0 and H1
3. Calculation of the test statistic value
4. Calculation of the p-value
5. A decision of whether or not to reject H0
6. An inferential conclusion about what the test result means
7. Indication of which type of error might have occurred.

Watch:
Statistics: Introduction to Hypothesis Testing in Filipino
https://www.youtube.com/watch?v=plAiYXYaqY0

Statistics: Type I and Type II Errors in Filipino


https://www.youtube.com/watch?v=Sdw2E7Xi0Q0

97
MODULE 6 – APPLICATIONS

OVERVIEW:
We conclude the course with a cross-section of applications of content covered in previous
modules to more advanced modelling applications of the real world. We begin with the topic called
decision tree analysis. Arguably, this sort of brings us full circle with our initial look in module one
at decision making under uncertainty. Remember, we need to make decisions in the present
for which we don't know exactly what's going to happen in the future. So we have these unknown
or uncertain future outcomes. How do we look at modelling this kind of situation? Well, decision
trees help us along the way.

For example, when you're playing a game of chess, you're deciding which piece to move.
But if you're a good chess player, you're not deciding on your next move, you're trying to anticipate
the next move of your opponent. If you're a really good chess player, you'll then start to think not
just those two moves ahead but further moves ahead as well. Of course you cannot know exactly
what your opponent is going to do, there's some uncertainty there but your decision making will
be based on your expectations of what your opponent may do.

MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Use simple decision tree analysis to model decision-making under uncertainty.
2. Interpret the beta of a stock as a common risk measure used in finance.
3. Describe the principles of linear programming and Monte Carlo simulation.

COURSE MATERIALS:
6.1 Decision Tree Analysis

Module 1 introduced the concept of decision making under uncertainty whereby decisions are
taken in the present with uncertain future outcomes.

Decision tree analysis is an interesting modelling technique which allows us to incorporate


probabilities in the decision-making process to model and quantify the uncertainty.

Of course, Module 2.1 explained that we could determine probabilities using one of three
methods:
 subjectively
 by experimentation (empirically)
 theoretically.

Some examples of managerial decisions under uncertainty include:


 selection of suppliers (which? how many?)
 research and development (R&D) and investment decisions (which project? how many
resources?)
 hiring/promotion decisions (to hire, or not to hire?).

98
In what follows we will be concerned with decision analysis, i.e. where there is only one rational
decision-maker making non-strategic decisions.

(Game theory involves two or more rational decision-makers making strategic decisions.)

Example:

Imagine you are an ice-cream manufacturer, and for simplicity suppose your level of sales can be
either high or low. Hence high and low sales are mutually exclusive and collectively exhaustive in
this example. (Note: Clearly, in practice a continuum of sales levels could be expected. Applied
here would needlessly complicate the analysis – the binary set of outcomes of high and low sales
is sufficient to demonstrate the principle and use of decision tree analysis.)

If sales are high you earn a profit of $300,000 (excluding advertising costs), but if sales are low
you experience a loss of $100,000 (excluding advertising costs).

You have the choice of whether to advertise your product or not. Advertising costs would be fixed
at $100,000.

If you advertise sales are high with a probability of 0.9, but if you do not advertise sales are high
with a probability of just 0.6. Note advertising does not guarantee success (not all advertising
campaigns are successful!) so here we model advertising as increasing the probability of the
‘good’ outcome (i.e. high sales). Viewed as conditional probabilities, these are:

P(high sales | no advertising) = 0.6


P(low sales | no advertising) = 0.4
and:
P(high sales | advertising) = 0.9
P(low sales | advertising) = 0.1

A standard decision tree consists of the following components:


 Decision nodes indicate that the decision-maker has to make a choice, denoted by
square.
 Chance nodes indicate the resolution of uncertainty, denoted by circle.
 Branches represent the choices available to the decision-maker (if leading from decision
nodes) or the possible outcomes if uncertainty is resolved (leading from chance nodes).
 Probabilities are written at the branches leading from chance nodes.
 Payoffs are written at the end of the final branches.

A decision tree has the following properties:


 No loops.
 One initial node.
 At most one branch between any two nodes.
 Connected paths.
 At a decision node the decision-maker has information on all preceding events, in
particular on the resolution of uncertainty.

We are now in a position to draw the decision tree for the ice-cream manufacturer problem. Note
that decision trees are read from left to right, representing the time order of events in a logical
manner.

99
The decision tree is:

On the far left the tree begins with a decision node where we have to decide whether to advertise
or not without knowledge of whether sales turn out to be high or low. After the decision is made,
chance takes over and resolves the realized level of sales according to the respective probability
distribution, ultimately resulting in our payoff. Note that the payoffs in the top half of the decision
tree ($200,000 and – $200,000) are simply the corresponding payoffs of $300,000 and
– $100,000, less the fixed advertising costs of $100,000.

In order to solve the decision tree we calculate the expected monetary value (EMV) of each
option (advertise and not advertise) and proceed whereby the decision-maker maximizes
expected profits.

The EMV is simply an expected value, and so in this discrete setting we apply our usual
probability-weighted average approach. We have:

E(advertise) = 0.9 × $200,000 + 0.1 × −$200,000


= $160,000
and:
E(not advertise) = 0.6 × $300,000 + 0.4 × −$100,000
= $140,000

Hence the optimal (recommended) strategy is to advertise, since this results in a higher expected
payoff ($160,000 > $140,000).

Remember that an expected value should be viewed as a long-run average. Clearly, by deciding
to advertise we will not make $160,000 – the possible outcomes are either a profit of $200,000
(with probability 0.9) or a loss of $200,000 (with probability 0.1).

However, rather than a “one-shot” game, imagine the game was played annually over 10 years.
By choosing “advertise” each time, then we would expect high sales in 9 years (each time with a
profit of $200,000) and low sales in 1 year (with a loss of $200,000), reflecting the probability
distribution used. Hence in the long run this would average to be $160,000.

100
Of course, this example has failed to account for risk – the subject of Module 6.2.

6.2 Risk

The decision tree analysis in Module 6.1 was solved by choosing the option with the maximum
expected profit. As such, we only considered the mean (average) outcome.

In practice people care about risk and tend to factor it into their decision making. Of course,
different people have different attitudes to risk so we can profile people's risk appetite as follows.

Degrees of Risk Aversion

A decision-maker is risk-averse if s/he prefers the certain outcome of $x over a risky project with
a mean (EMV) of $x.

A decision-maker is risk-loving (also known as risk-seeking) if s/he prefers a risky project with a
mean (EMV) of $x over the certain outcome of $x.

A decision-maker is risk-neutral if s/he is indifferent between a sure payoff and an uncertain


outcome with the same expected monetary value.

The certainty equivalent (CE) of a risky project is the amount of money which makes the
decision-maker indifferent between receiving this amount for sure and the risky project.

How might we model risk?

Example:

A risky project pays out $100 with a probability of 0.5 and $0 with a probability of 0.5. The certainty
equivalent, X, makes the decision-maker indifferent between the certain outcome X and the risky
project.

Consider the following decision tree:

101
Let X be the value which makes you indifferent between the safe and risky assets. Immediately
we see that:
E(risky) = 0.5 × $100 + 0.5 × $0 = $50:

Consider a risk-neutral individual, for whom X = 50:

Hence:
E(safe) = 1 × $50 = $50

so this individual does not care about risk and only focuses on the expected return. Therefore, for
X = 50, such an individual sees no difference between the safe and risky assets, even though the
safe asset is risk-free while the risky asset is risky – hence the name.

The decision tree for a risk-averse individual would be:

and for a risk-loving-individual:

102
Example:

In Module 3.4 you saw the following diagram used to illustrate the returns of two stocks:

At the time it was noted that the black stock was the safer stock due to its smaller variation. Hence
depending on an investor's risk appetite, they would invest in the black or red stock accordingly.

Risk Premium

The risk premium (of a risky project) is defined as:

EMV – CE

Interpretation: The amount of money the decision-maker is willing to pay to receive a safe payoff
of X rather than face the risky project with an expected payoff of X.

Risk profiles can be determined using the following:

CE < EMV ⇒ risk-averse


CE = EMV ⇒ risk-neutral
CE > EMV ⇒ risk-loving

If risk-neutral, the decision-maker uses the EMV criterion as seen in Module 6.1.

103
6.3 Linear Regression

Linear regression analysis is one of the most frequently-used statistical techniques. It aims to
model an explicit relationship between one dependent variable, denoted as y, and one or more
regressors (also called covariates, or independent variables), denoted as x1, …, xp.

The goal of regression analysis is to understand how y depends on x1, …, xp and to predict or
control the unobserved y based on the observed x1, …, xp. We only consider simple examples
with p = 1.

Example:

In a university town, the sales, y, of 10 pizza parlor restaurants are closely related to the student
population, x, in their neighborhoods.

The scatterplot above shows the sales (in thousands of pesos) in a period of three months
together with the numbers of students (in thousands) in their neighborhoods.

We plot y against x, and draw a straight line through the middle of the data points:

𝑦 = 𝛼 + 𝛽𝑥 + 𝜀

where 𝜀 stands for a random error term, 𝛼 is the intercept and 𝛽 is the slope of the straight line.

For a given student population, x, the predicted sales are:

𝑦̂ = 𝛼 + 𝛽𝑥

104
Some other possible examples of y and x are shown in the following table:

𝑦 𝑥
Sales Price
Weight Gains Protein in Diet
Present PSE 100 index Past PSE 100 index
Consumption Income
Salary Tenure
Daughter’s Height Mother’s Height

The Simple Linear Regression Model

We now present the simple linear regression model. Let the paired observations (x1, y1), …, (xn,
yn) be drawn from the model:

𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖

Where:

E(𝜀𝑖 ) = 0 and Var(𝜀𝑖 ) = 𝜎 2 > 0

So the model has three parameters: 𝛽0 , 𝛽1 , 𝑎𝑛𝑑 𝜎 2 . In a formal topic on regression you would
consider the following questions:

 How to draw a line through data clouds, i.e. how to 𝛼 and 𝛽?


 How accurate is the fitted line?
 What is the error in predicting a future 𝑦?

Example:

We can apply the simple linear regression model to study the relationship between two series of
financial returns – a regression of a stock's returns, 𝑦, on the returns of an underlying market
index, 𝑥. This regression model is an example of the capital asset pricing model (CAPM).

Stock returns are defined as:

current price − previous price current price


return = ≈ log ( )
previous price previous price

when the difference between the two prices is small. Daily prices are definitely not independent.
However, daily returns may be seen as a sequence of uncorrelated random variables.

The capital asset pricing model (CAPM) is a simple asset pricing model in finance given by:

𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖

where 𝑦𝑖 is a stock return and 𝑥𝑖 is a market return at time 𝑖.

105
Note:

Some remarks are the following:

i. 𝛽 measures the market-related (or systematic) risk of the stock.


ii. Market-related risk is unavoidable, while firm-specific risk may be “diversified away”
through hedging.
iii. Variance is a simple measure (and one of the most frequently-used) of risk in finance.

So the ‘beta’ of a stock is a simple measure of the riskiness of that stock with respect to the market
index. By definition, the market index has 𝛽 = 1.

If a stock has a beta of 1, then:


if the market index ↑ by 1%, then the stock ↑ by 1%
and:
if the market index ↓ by 1%, then the stock ↓ by 1%:

If a stock has a beta of 2, then:


if the market index ↑ by 1%, then the stock ↑ by 2%
and:
if the market index ↓ by 1%, then the stock ↓ by 2%:

If a stock has a beta of 0.5, then:


if the market index ↑ by 1%, then the stock ↑ by 0.5%
and:
if the market index ↓ by 1%, then the stock ↓ by 0.5%:

In summary:
if 𝛽 > 1 ⇒ risky stocks
as market movements are amplified in the stock's returns, and:
if 𝛽 < 1 ⇒ defensive stocks
as market movements are muted in the stock's returns.

106
6.4 Linear Programming

Linear programming is probably one of the most-used type of quantitative business model. It can
be applied in any environment where finite resources must be allocated to competing activities or
processes for maximum benefit, for example:

 selecting an investment portfolio of stocks to maximize return


 allocating a fixed budget between competing departments
 allocating lorries to routes to minimize the transportation costs incurred by a distribution
company.

Optimization Models

All optimization models have several common elements.


 Decision variables, or the variables whose values the decision-maker is allowed to
choose. These are the variables which a company must know to function properly – they
determine everything else.
 Objective function to be optimized – either maximized or minimized.
 Constraints which must be satisfied – physical, logical or economic restrictions,
depending on the nature of the problem.

Microsoft Excel can be used to solve linear programming problems using Solver. Excel has its
own terminology for optimization.

 The changing cells contain values of the decision variables.


 The objective cell contains the objective function to be minimized or maximized.
 The constraints impose restrictions on the values in the changing cells.
 Non-negativity constraints imply that the changing cells must contain non-negative
numbers.

Solving Optimization Problems

The first step is the model development step. You must decide:

 the decision variables, the objective and the constraints


 how everything fits together, i.e. develop correct algebraic expressions and relate all
variables with appropriate formula.

The second step is to optimize.

 A feasible solution is a solution which satisfies all of the constraints.


 The feasible region is the set of all feasible solutions.
 An infeasible solution violates at least one of the constraints.
 The optimal solution is the feasible solution which optimizes the objective.

The third step is to perform a sensitivity analysis – to what extent is the final solution sensitive to
parameter values used in the model. We omit this stage in our simple example below.

107
Example:

A frequent problem in business is the product mix problem. Suppose a company must decide on
a product mix (how much of each product to introduce) to maximize profit.

Suppose a firm produces two types of chocolate bar – type A bars and type B bars. A type A bar
requires 10 grams of cocoa and 1 minute of machine time. A type B bar requires 5 grams of cocoa
and 4 minutes of machine time. So type A bars are more cocoa-intensive, while type B bars are
more intricate requiring a longer production time.

Altogether 2,000 grams of cocoa and 480 minutes of machine time are available each day.
Assume no other resources are required.

The manufacturer makes 10 dollars profit from each type A bar and 20 dollars profit from each
type B bar. Assume all chocolate bars produced are sold.

Define the decision variables as follows:

 x = quantity of type A bars


 y = quantity of type B bars.

The objective function is:

10x + 20y

which should be maximized subject to the constraints:

10x + 5y= 2000 (cocoa)


x + 4y= 480 (machine time)

We also require non-negativity for the solution to be economically meaningful, so:

x ≥ 0 and y ≥ 0

The maximum (and minimum) values of the objective function lie at a corner of the feasible region.
The two lines intersect when both constraint equations are true, hence:

10x + 5y = 2000 and x + 4y = 480

This happens at the point (160, 80). We can find the values of the objective function at each
corner by substitution into 10x + 20y. The four corners of the feasible region are:

 (0, 0), with profit 10x + 20y = 0


 (0, 120), with profit 10x + 20y = 2400
 (200, 0), with profit 10x + 20y = 2000
 (160, 80), with profit 10x + 20y = 3200.

The optimal solution is $3200 dollars which occurs by making 160 type A bars and 80 type B
bars.

108
In the accompanying Excel file, the problem is solved using Solver. Cells A1 and A2 are the
changing cells (where the solution x = 160 and y = 80 is returned). Cell B1 is the objective cell,
with formula =10*A1+20*A2 representing the objective function 10x+20y. Cells C1 and C2 contain
the constraints, with formula =10*A1+5*A2 and =A1+4*A2, respectively, representing the left-
hand sides of the constraints 10x + 5y = 2000 and x + 4y = 480.

With the Solver Add-in loaded, opening it shows that:

 ‘Set Target Cell’ is set to $B$1


 ‘Equal To’ is set to ‘Max’ (since we want to maximize the objective function)
 ‘By Changing Cells’ is set to $A$1:$A$2 identifying the cells where the solution should be
returned
 ‘Subject to the Constraints’ lists $C$1 <= 2000 and $C$2 <= 480 as the full constraints
 ‘Options’ allows for non-negativity to be stipulated by checking ‘Assume Non-Negative’.

6.5 Monte Carlo Simulation

We are forever bound by uncertainty. To study, or not to study? To invest, or not to invest? To
marry, or not to marry? Other examples are:

 what will be the profits of our new product?


 what are the best logistics for our new supply chain?

Often, several different system configurations are available. For example, from our R&D portfolio,
there are several products we could release, but we have to select just one due to our marketing
budget constraint.

Trying different solutions is expensive and/or risky. For example:

 testing different products in the market


 designing a new schedule of arrivals/departures for an airport.

Improving the system is difficult. For example:

 which is the most profitable product to be released in the market?


 what is the optimal flight schedule for an airport?

Finding a robust system design is a challenge. For example:


 a product which sells well under tough market conditions
 an airport schedule which can deal with delays and disruptions.

Monte Carlo simulation works by simulating multiple hypothetical future worlds reflecting that
the outcome variable of interest is a random variable with a probability distribution. We can then
determine the expected outcome, as well as a quantification of risk, using the usual statistics
of mean and variance. Decision-makers can then make an informed judgment about the best
course of action based on their risk appetite.

109
To perform a Monte Carlo simulation we proceed as follows.

1. Associate a (pseudo-)random number generator for each input variable.


2. Assign a range of random numbers for each input variable (according to some assumed
probability distribution).
3. For each input variable:
 generate a random number
 from the random number, select the respective variable value.
4. Calculate the outcome, x, and record it.
5. Repeat “3” and “4” until the desired number of iterations, N, is reached.
6. Draw a histogram of the outcomes and determine the mean and variance of the simulated
outputs:

𝑁
1
𝑥̅ = ∑ 𝑥𝑖
𝑁
𝑖=1

And

𝑁
1
𝑠2 = ∑(𝑥𝑖 − 𝑥̅ )2
𝑁−1
𝑖=1

The mean gives our estimate of the expected outcome, while the variance is our measure of risk.

Example:

A company wants to analyze the following investment with the following uncertain revenues and
costs given by the probability distributions below. Assume, for simplicity, that the revenues and
costs are uncorrelated.

Revenues $50,000 $80,000 $100,000


P(Revenues) 0.2 0.4 0.4

Costs $30,000 $60,000 $80,000


P(Costs) 0.2 0.6 0.2

By using Microsoft Excel’s ‘=RAND()’ function, we can randomly select a number equally likely to
be between 0 and 1:

110
Given the probability distribution for revenues, we could simulate revenues depending on the
value returned by ‘=RAND()’ according to:

which ensures each revenue value occurs with the required probabilities. Similarly, for costs, we
could use:

Suppose for the first simulation we had random numbers for revenues and costs of 0.7467 and
0.1672, respectively. This corresponds to:

Revenue = $100,000
Costs = $30,000
Profit = $70,000

Suppose for the second simulation we had random numbers for revenues and costs of 0.0384
and 0.9056, respectively. This corresponds to:

Revenue = $50,000
Costs = $80,000
Profit = −$30,000 (a loss)

This process would then be replicated ideally thousands of times (N = the number of simulations),
resulting in the calculation of the mean and variance of profit to estimate the expected outcome
and risk.

Of course, the quality of these expected outcome and risk estimates is only as good as the model
used. What if the true probability distributions were different? What if revenues and costs were
correlated? What if other factors affected profit as well?

Clearly, in practice we would wish to conduct a sensitivity analysis to see how sensitive (or
robust) the distribution of the outcome variable is to such issues.

111
ACTIVITIES/ASSESSMENT:

Direction: Choose the letter of the correct answer.

1. In a decision tree, chance nodes are represented by squares.


a. True
b. False

2. Ignoring risk, in decision tree analysis we choose the option which will maximize the expected
profit.
a. True
b. False

3. A decision-maker who prefers the certain outcome of $x over a risk project with a mean (EMV)
of $x is:
a. risk-averse
b. risk-neutral
c. risk-loving

4. In linear regression analysis, the independent variable is typically denoted by:


a. x
b. y
c. 𝜀

5. Stock returns are defined as:


current price−previous price
a. previous price
previous price−current price
b.
current price
current price−previous price
c. current price

6. If a stock has beta of 1.5, then:


a. if the market index ↑ by 1.5%, then the stock ↑ by 1.5%
b. if the market index ↑ by 1.5%, then the stock ↑ by 1%
c. if the market index ↑ by 1%, then the stock ↑ by 1.5%

7. Constrained optimization problems are best modelled using:


a. decision tree analysis
b. linear regression
c. linear programming

8. In a standard product mix problem, non-negativity constraints are required for the decision
variables.
a. True
b. False

9. The results of a Monte Carlo simulation are never sensitive to the choice of input probability
distribution(s).
a. True
b. False

112
10. In Monte Carlo simulation, the expected outcome can be calculated using the:
a. sample mean of the simulated values of the outcome variable
b. sample variance of the simulated values of the outcome variable

11. In a decision tree, decision nodes are represented by circles.


a. True
b. False

12. Ignoring risk, in decision tree analysis we choose option which will minimize the expected
payoff.
a. True
b. False

13. When the difference between prices is small, stock returns are approximately equal to:
current price
a. log (previous price)
previuos price
b. log ( )
current price
current price−previous price
c. log ( )
previous price

14. Linear programming models allow us to analyze:


a. constrained optimization problems
b. unconstrained optimization problems

15. In Monte Carlo simulation, the quantification of risks can be calculated using the:
a. sample mean of the simulated values of the outcome variable.
b. sample variance of the simulated values of the outcome variable.

Module References:

Abdey, James (2018). Probability and Statistics.


https://www.coursera.org/

Cowan, Glen (1998). Statistical Data Analysis


http://www.sherrytowers.com/cowan_statistical_data_analysis.pdf

De Smith, Michael (2018). Statistical Analysis Handbook.


https://www.statsref.com/StatsRefSample.pdf

Skuza, Pawel (2013). Introduction to Statistical Analysis.


https://ienrol.flinders.edu.au/uploads/Introduction_Statistical_Analysis.pdf

113

You might also like