You are on page 1of 6

TESTING OF HYPOTHESES

Kaustav Banerjee
Decision Sciences Area, IIM Lucknow

1 Decision under uncertainty


Application 1: Studies show that 40% or more of email is spam, and other reports claim that
nearly 95% of all email is spam, causing businesses billions of dollars in lost productivity. To
hold back the flood, companies buy software that filters out junk before it reaches in-boxes.
An office currently uses free software that reduces the amount of spam to 24% of the incoming
messages. To remove more, the office manager is evaluating a commercial product. The vendor
claims it works better than free software and will not lose any valid messages. To demonstrate,
the vendor has offered to apply filtering software to email arriving at the office. The office
manager plans to use this trial to judge whether the commercial software is cost effective. The
vendor licenses this software for $15,000 a year. At this price, the accounting department says
the software will pay for itself by improving productivity if it reduces the level of spam to less
than 20%. How well must the commercial software perform in the trial to convince the office
manager to pay for the license?

All the emails that would pass through the filtering software, if installed, constitute the
population of interest here. Some fraction (p) of these emails will be spam. Purchasing the
software will be profitable if p < 0.20. If p ≥ 0.20, the software will not remove enough spam
to compensate for the fee. The two conjectures regarding p are complementary, and only one
of them can be true: purchasing the software is either cost effective or not.
Now assess the risks involved in this decision problem. During the trial, the manager could
check the sample proportion of spams say p̄, a natural estimator of the true proportion of spams
p. The sample proportion p̄ could be close to or far from p. So, looking at p̄, the manager
could decide to buy (p̄ < 0.20) the software, while actually it is ineffective (p ≥ 0.20): that’s
incurring a loss. Alternatively, the manager may fail to realize (p̄ ≥ 0.20) that the software is
quite effective (p < 0.20): that’s a lost opportunity. Can she avoid both the risks?
It’s easy to avoid the risk of buying an ineffective software: she could simply stop buying
any software, however effective. Then it inflates the risk of not buying an effective software.
To avoid the risk of not buying an effective software, she could by any filtering software under
the sun. Then it inflates the risk of buying ineffective software! Clearly, the risks work against
each other and one should prioritize them. Which one of these could be more serious for the
company?
Next we formulate this problem statistically.

1
2 Statistical formulation
Notice that, our parameter of interest p (proportion of spams not filtered by the software) is a
feature of the population (all the emails the office would receive via the commercial software,
if purchased). Regarding p there are two complementary views/hypotheses/conjectures, p ≥
0.20 and p < 0.20, formulated as the null (H0 ) and alternative (Ha ) hypotheses. The null
hypothesis usually represents the view that is content with the existing state of affairs and
thus, discourages any change. Alternative hypothesis represents the view that challenges the
existing state of affairs, and calls for a change.

What do you think? In the software example, which one should be regarded
as the null hypothesis?

Alternative hypothesis can be classified as one-sided and two-sided alternatives. To fix the
idea, consider the following process control exercise.
Application 2: A manufacturing process is
supposed to produce capsules having 400 mg
of a chemical. Since variation in a manufactur-
402

ing process is inherent, so the contents of dif-


ferent capsules would vary. Regulatory author-
ity makes it mandatory that the content of ev-
401

ery capsule should be between 398 and 402 mg.


● ● ● To ensure it, the mean and standard deviation
Limits

400


● ●


of the contents produced by the manufacturing

process are set at 400 mg (µ) and 0.5 mg (σ).


The production supervisor knows from experi-
399

ence that the standard deviation of the process


does rarely change. However, he feels that con-
398

1 2 3 4 5 6 7 8 9 10
tinuous monitoring of the process is necessary
Sample for checking the stability of the mean of the pro-
cess. A consultant suggested him implementing
the following procedure: in every hour during a
shift a sample of 100 capsules is to be selected and if the average content of the sample falls
below 399.90 or above 400.10 stop the process and hunt for the trouble. So we need to assess
if µ is in-control or out-of-control.

What do you think? In the process control example, what are the risks
involved? What should be the null hypothesis? What type of alternative
hypothesis is that? What type of alternative was set in the software example?

In general, how do we prioritize the risks involved in a decision problem? Consider the following
analogy from courtroom. If you are charged with a crime, the court believes you are innocent,
until proven guilty. This way, rejecting a true null hypothesis is equivalent to convicting an
innocent person. Similarly, failing to reject a false null hypothesis is equivalent to letting a
guilty person go free. Lawmakers hold that convicting an innocent person is more serious,
than letting a guilty person go free.

2
True State of Affairs
H0 True H0 False
Reject H0 Type I Error Correct Decision
Sample-based Decision
Fail to Reject H0 Correct Decision Type II Error

Analogously we regard rejecting a true null hypothesis (Type I error) is much more severe
than not rejecting a false null hypothesis (Type II error). So we don’t want the probability of
committing type I error to exceed a pre-assigned (prior to sampling/data collection) threshold
value α: known as the level of significance, and usually set at 1%, 5% or 10%. This gives
a very important criterion to evaluate any decision rule: whether it keeps the probability of
type I error below the nominated level of significance, i.e. P (Type I error) ≤ α.

Do it yourself : Identify the risks involved in the following decision problems


and formulate a suitable null and alternative hypothesis in each case.
1. A person meets an interview board for a job opening. The board has to
decide whether to hire the person.
2. Research & development wing of a pharmaceutical company develops a
new drug. The company has to decide whether to file for a patent.

3 Statistical decision rule


Recalling the software problem, we need to test H0 : p ≥ 0.20 against Ha : p < 0.20. The
manager has to take a call on the basis of p̄, sample fraction of spams received via the software,
during the trial period. The catch is, though p̄ is an unbiased estimator of p, the observed
value of p̄, say p̄obs , obtained on the basis of a sample of emails, could be far off from p, due
to sampling fluctuation. Therefore it is imperative that while looking at an estimate p̄obs , one
should assess how likely the estimate is, with reference to the sampling distribution of p̄.
Just as in courtroom, the jury believes an accused is innocent until proven guilty, we believe
that the null hypothesis is true, until proven otherwise. Hence we consider the sampling
distribution of p̄, assuming that H0 is true. By CLT, we could write:
 p 
H
p̄ ∼0 N p, p(1 − p)/n
p=0.20

Notice that, we choose to consider the sampling distribution of p̄ at the break-even value 20%
and not at some value larger than 20% implied by H0 . This is because if p̄ is around 20%, we
feel most uncertain in our decision. However, as p̄ deviates from 20% in either direction, it is
much easier to decide accordingly.
Having believed in the modified null hypothesis H0 : p = 0.20, how likely an estimate p̄obs
is, with reference to the sampling distribution of p̄ as above? This is assessed by checking if

P (p̄ < p̄obs |H0 : p = 0.20) = p-value < α.

Notice, for an estimate p̄obs , if the associated p-value falls short of α, it is extremely unlikely
that such an estimate would be observed under the null hypothesis H0 : p = 0.20. In short,
such an estimate is a strong evidence against H0 . Consequently, we reject the null hypothesis.

3
However, if the p-value exceeds α, the
evidence against the null is not quite
strong, and we fail to reject the null
hypothesis, as per the data. Observe
that, this p-value criterion can also be
formulated as, to test H0 : p = 0.20
against Ha : p < 0.20
Probability density function

Reject H0 if

p̄ − 0.20
p obs < −Zα
(0.20 × 0.80)/n

Do not reject H0 if otherwise.


Considering the graph of the sampling
(1 − α) distribution of p̄, can you figure out the
0.2 − Zα × {(0.2 × 0.8) n} 0.5
0.2
reason behind the p-value criterion, or
p
the equivalent criterion as above? Do
you see how the decision would vary,
according as the location of p̄obs on the
x -axis of the graph?

Do it yourself : Answer the following questions.


1. Taking α = 5% and n = 100, what is the threshold value for p̄obs , which
should convince
n the manager to buy the osoftware?
p
2. The set p̄ : p̄ < 0.2 − Zα (0.2 × 0.8)/n represents the critical region
for the above testing problem. Do you see why it is called ‘critical’ ?
3. Keeping the sample size n fixed, as α is increased from 1% to 10%, how
the critical region would change? What would be its impact on the
performance of the decision rule?
4. Keeping α fixed at some value, if the sample size n is increased indefi-
nitely, how the critical region would change? What would be its impact
on the performance of the decision rule?
5. Taking α = 5% and n = 100, P (p̄ < 0.12|H0 : p = 0.20) = 0.023. Under
the same set-up, P (p̄ < 0.12|H0 : p = 0.23) = 0.003. Do you see why?

Application 2: The problem is to assess if the process average µ is in-control or out-of-control.


So we need to test H0 : µ = 400 against Ha : µ 6= 400. The decision depends on X̄: the sample
average weight of the capsules. As before, we start with believing in H0 : µ = 400 and try to
assess how likely the observed value x̄obs is, with reference to the following distribution
H √ 
X̄ ∼0 N µ, σ/ n µ=400;σ=0.5

and the likeliness is assessed by checking if



2 × min P (X̄ > x̄obs |H0 : µ = 400), P (X̄ < x̄obs |H0 : µ = 400) = p-value < α.

4
Therefore to test H0 : µ = 400 against
Ha : µ 6= 400, we reject H0 if this p-
value falls short of α; and fail to re-
ject H0 if otherwise. Alternatively, we
could say
Reject H0 if
Probability density function

|x̄obs − 400|
√ > Zα/2
0.5/ n

Do not reject H0 if otherwise.


Considering the graph of the sampling
distribution of X̄, can you figure out
the reason behind the p-value criterion,
(1 − α) or the equivalent criterion as above?
400 − Zα 2 × (σ n) 400 400 + Zα 2 × (σ n) Do you see how the decision would
X vary, according as the location of x̄obs
on the x -axis of the graph? Do you see
why there are two cut-off points at the
extremes unlike before?

Do it yourself : Answer the following questions.


1. For α = 5% and n = 100, what are the thresholds for x̄obs , to convince
the supervisor to stop the process and hunt for the trouble?
2. Obtain the critical region for the above testing problem.
3. How do you interpret the probability of type I error in this case?
4. Keeping the sample size n fixed, as α is increased from 1% to 10%, how
the critical region would change? What would be its impact on the
performance of the decision rule?
5. Keeping α fixed at some value, if the sample size n is increased indefi-
nitely, how the critical region would change? What would be its impact
on the performance of the decision rule?

Next, continuing with the software example, what is the probability the manager will purchase
a good (p = 0.15) software? Alternatively, what is the probability of rejecting the null H0 :
p = 0.20, when it is indeed false? To see this, taking α = 5% and n = 100, we obtain
" #
0.1342 − p
P (p̄ < 0.1342 | p = 0.15) = P Z < p | p = 0.15 ≈ 0.33
p(1 − p)/n

So compared to the chance of purchasing a bad software [P (type I error) ≤ 5%], the probability
of purchasing a good (p = 0.15) software is 33%. So our decision rule is designed to take the
correct decision more often than it takes the wrong decision. This is the power of the decision
rule, expressed as P (Rejecting H0 | H0 is false) = 1 − β, where β represents the probability of
type II error: the probability of not rejecting a null hypothesis, which is indeed false.

5
This graph illustrates the probability of
purchasing a software, against the true
1.0

proportion of spams it allows to enter


the mailbox. These probabilities are
obtained using the decision rule formu-
0.8

lated before, for α = 5% and n = 100.


What do you learn from the graph? If
α is changed to 1% or 10%, how does it
Purchase probability

0.6

impact the graph? What would be the


consequence of this change in terms of
0.4

the decision problem?


The function represented by the
dashed line is called the power function
0.2

of the decision rule. Check that, in the


present context, the power function is
0.00 0.05 0.10 0.15 0.20
represented by
True proportion of spam !
0.1342 − p
F p
p(1 − p)/n

where F denotes lower-tail probability (cumulative distribution function) of a standard normal


variate and p stands for the true proportion of spam allowed by the software.
Think about it
1. A chemical firm has been accused of polluting the local river system. State laws require
the accuser to prove the polluting by a statistical analysis of water samples. Is the
chemical firm worried about a type I or a type II error?
2. The research labs of a corporation occasionally produce breakthroughs that can lead
to multibillion-dollar blockbuster products. Should the managers of the labs be more
worried about type I or type II errors?
3. To demonstrate that a planned commercial will be cost effective, at least 60% of those
watching the programming need to see the commercial (rather than switching stations
or using a digital recorder to skip the commercial). Typically, half of viewers watch the
commercials. It is possible to measure the behavior of viewers in 1,000,000 households,
but is such a large sample needed?
4. A jury of 12 begins with the premise that the accused is innocent. Assume that these
12 jurors were chosen from a large population, such as voters. Unless the jury votes
unanimously for conviction, the accused is set free.
(a) Evidence in the trial of an innocent suspect is enough to convince half of all jurors
in the population that the suspect is guilty. What is the probability that a jury
convicts an innocent suspect?
(b) What type of error (type I or type II) is committed by the jury in part (a)?
(c) Evidence in the trial of a guilty suspect is enough to convince 95% of all jurors in
the population that the suspect is guilty. What is the probability that a jury fails
to convict the guilty suspect?
(d) What type of error (type I or type II) is committed by the jury in part (c)?

You might also like