You are on page 1of 41

Green Belt Six Sigma Training

Statistical Testing : Part 1


 Introduction
 Hypothesis Testing
 P-values vs. Significance Testing
 One Sample t-Test
 Equivalence Testing
 Skills Practice
Introduction
Recall that two popular statistical methods used to draw inferences
about population characteristics from sample data are:
Statistical Intervals and Hypothesis Tests.
 Statistical Intervals are used to bound the uncertainty we have
in the true value of population characteristics such as the true
population mean .
 In Hypothesis Testing, we construct hypotheses about the true
value of unknown population parameter values.
Then, we collect data and, using statistical analysis, quantify
risks of incorrect decisions relative to the hypotheses.
In this section we will discuss hypothesis testing and the related
topic of equivalence tests.
2016 Philip J. Ramsey, Ph.D. 2
Hypothesis Testing
Recall that a confidence interval is constructed from sample data and
provides a set of plausible values for the true (and unknown) mean
of the population from which the sample is drawn.
In hypothesis testing, we postulate values for the parameter, and then
use data to see if these values are plausible.
In other words, we construct hypotheses about the unknown
parameter, collect data, quantify the risks of incorrect decisions
relative to the hypotheses.
In this section we will see how to compare a sample mean with a
hypothesized population mean. The hypothesized mean is often a
standard of performance, such as a target, for a process.
Hypothesis testing can also be used in a broad range of other
situations.
2016 Philip J. Ramsey, Ph.D. 3
Hypothesis Testing
There are actually two competing and incoherent approaches to
hypothesis testing that regrettably have been conflated for decades.
The end result is that most text books and training in hypothesis
testing is an admixture of two incompatible approaches resulting
in an approach and statements that are simply wrong.
Method 1 was developed around a century ago by Sir R. A. Fisher,
the founder of modern statistics; read the “The Lady Tasting Tea..”
by David Salzburg for details.
Method 2 was developed in the 1930’s by Jerzy Neyman.
Both men vehemently disagreed on their two different approaches to
hypothesis testing and their differences were never resolved.
Unfortunately textbooks simply conflated both approaches into one.
2016 Philip J. Ramsey, Ph.D. 4
Hypothesis Testing
The concept of hypothesis testing predates Fisher by decades, but he
is the first to formalize an approach to testing.
Fisher’s method was developed for the analysis of designed
experiments, which he pioneered in the early 1920s.
The method begins by stating a null hypothesis H0 which is a
statement of null or 0 experimental effects occurring in the study.
The null hypothesis is really a strawman hypothesis that the
researcher hopes to knock down with the experimental data.
To estimate the evidence against the H0 in the experimental data
Fisher developed what is known as a p-value.
The p-value is the probability of the data occurring if H0 were
actually true – this is a slight simplification for now.
2016 Philip J. Ramsey, Ph.D. 5
Hypothesis Testing
Neyman’s approach required two hypotheses, an hypothesis in
test (often mistakenly called the null in textbooks) and an
alternative that is accepted if the hypothesis in test is rejected as
false after data have been collected..
The null hypothesis (we simply use this term), is denoted H0, and
an alternative hypothesis, is denoted HA.
Fisher strongly opposed the idea of an alternative hypothesis.
The null hypothesis is still a strawman hypothesis to hopefully be
knocked down with the data; the alternative is accepted if the
null is reject.
In both approaches, the hypotheses are almost always stated in
terms of population parameters such as the theoretical mean µ.
2016 Philip J. Ramsey, Ph.D. 6
Hypothesis Testing
Whereas Fisher computed a probability called a p-value to judge
the validity of the null hypothesis, Neyman reject such
probabilities as “silly.”
Instead, in advance of the test Neyman stated a threshold value of
a sample statistic used in the test.
If the test statistic exceeded the threshold, then the null hypothesis
was rejected; we show the details later in the notes.
The threshold value was based upon what Neyman called the
significance of the test, which is the long run frequency with
which one might incorrectly reject a true null hypothesis.
No probability statement is made and one can make no statement
of validity for a single instance of the hypothesis test.
2016 Philip J. Ramsey, Ph.D. 7
Hypothesis Testing
Fisher reject the idea of significance as useless, because it only
made sense if one were to make a large number of hypothesis
tests on exactly the same physical system.
This rarely is the case in actual practice.
Unfortunately text books refer to p-values as observed
significance levels and they are no such thing.
The p-value is a probability statement for a single application of
an hypothesis test it does not apply to a long sequence of tests.
Significance is the long run frequency of incorrectly rejecting a
true null hypothesis over the performance of many tests.
The two methods are incoherent and cannot be easily reconciled.
2016 Philip J. Ramsey, Ph.D. 8
Hypothesis Testing
Common practice and teachings in hypothesis testing also
promulgate a number of fallacies.
1. If you fail to reject the null hypothesis then it is true and
observed results are due to random chance.

False: The null hypothesis is never proved true and failure to


reject it does not imply the results are random or chance.

2. Rejecting the null hypothesis implies the alternative is true.

False: Rejecting a null hypothesis says nothing about the


validity of an alternative; the tester jumps to that conclusion.

2016 Philip J. Ramsey, Ph.D. 9


Hypothesis Testing
In any case, the general procedure for hypothesis testing is:
1. The null hypothesis is assumed true.
2. Data is collected.
3. A summary measure, called a test statistic, is computed.
4. Based on the value of the test statistic, a decision is made
either to reject H0 or not to reject it. This decision is based
on whether the data provides evidence against H0.
5. Rejection of H0 is either based on a p-value or a significance
level but not both.
6. If H0 is rejected, HA is automatically accepted (remember
Fisher opposed use of an alternative).
2016 Philip J. Ramsey, Ph.D. 10
Hypothesis Testing

State H0 and HA.

Obtain data.

Is this data likely if H0 is true?

No. Then reject H0 Yes. Then don’t reject H0. It


Conclude that it is false. may be false, but we don’t have
Accept HA evidence to reject it.

2016 Philip J. Ramsey, Ph.D. 11


One Sample t-Test
In this section, we will discuss hypothesis testing in the context of
testing for the theoretical mean of a single normal population.
This is called a One Sample t-Test. The term “t–Test” comes
from the fact that the test is based on the student’s t
probability distribution, which we introduced in the Statistical
Intervals section.
As we did with confidence intervals, we will skip most of the
mathematical details given these tests are almost always
performed by a computer and hand computation is more or less
only a homework exercise.

2016 Philip J. Ramsey, Ph.D. 12


One Sample t-Test
We will use fluorescent lighting data to illustrate hypothesis testing
(file Lumens.jmp).
We are interested in determining if the mean lumen value for a lot
of fluorescent lamps is close to the specification target of 3500
lumens.
A technician has measured the intensities for each of 7 suitably
aged lamps randomly sampled from the specified lot:
3200, 3400, 4000, 3700, 2500, 3400, and 3700.
For this data, the calculated sample statistics are:
X  3414.3, S  481.1

2016 Philip J. Ramsey, Ph.D. 13


One Sample t-Test
We are attempting to address whether or not the distance between our
sample average and 3500 is sufficient to indicate that the
theoretical mean of the lot is different from the target of 3500.
The test can be set up as a one-tailed or a two-tailed test.
The null and alternative hypotheses can be set up as follows:
For a two-tailed test, H o :   3500 H A :   3500
For a one-tailed test, H o :   3500 H A :   3500, or
H o :   3500 H A :   3500
For this example, we will be conducting a two-tailed test.
Note that values of the average that are either much larger or smaller
than 3500 would cause us to doubt the validity of Ho.
2016 Philip J. Ramsey, Ph.D. 14
One Sample t-Test
Before proceeding we need to discuss a recent statement (3/2/2016)
by the American Statistical Association on the use of p-values as a
basis for evaluating the results of hypothesis tests.
“Practices that reduce data analysis or scientific inference to
mechanical “bright-line” rules (such as “p < 0.05”) for
justifying scientific claims or conclusions can lead to erroneous
beliefs and poor decision-making. A conclusion does not
immediately become “true” on one side of the divide and “false”
on the other. Researchers should bring many contextual factors
into play to derive scientific inferences, including the design of a
study, the quality of the measurements, the external evidence for
the phenomenon under study, and the validity of assumptions that
underlie the data analysis.”

2016 Philip J. Ramsey, Ph.D. 15


One Sample t-Test
Statistical software usually provides a p-value for hypothesis tests,
as a measure of the strength of evidence in the data against Ho.
Technically, a p-value is the probability of obtaining a more extreme
test value under the assumption that the null hypothesis is true.
It is a subtlety, but the p-value measures the probability of data that
did not occur, but might have if more data were collected.
A small p-value means that the data (not observed) are unlikely for
the given null hypothesis, and represents evidence against Ho.
Small p-values indicate that Ho should be rejected. It does not
prove that it is true!
JMP often indicates a p-value in terms of the distribution of the test
statistic (for example, Prob > |t|).
2016 Philip J. Ramsey, Ph.D. 16
One Sample t-Test
In a One-Sample t-Test, for example, a p-value is the probability of
observing a sample mean as extreme as the one we observed, if
the null hypothesis is true.

Small p-values are considered evidence against the null hypothesis


and would lead us to reject the null hypothesis.

The smaller the p-value, the stronger the evidence.

2016 Philip J. Ramsey, Ph.D. 17


Background Details: One Sample t-Test
The one sample t-Test is based on student’s t distribution. The
ratio X 
t
S
n
which has a Student’s t distribution with n-1 degrees of
freedom, where S is the estimate of , and n-1 are the dfs for S.

So, under our null hypothesis H0:  = 3500:


X  3500 follows a student t distribution with 6 df.
S
n
This quantity, calculated for our sample, is called the test statistic:
X  0 3414.3  3500
t   0.47
S 481.1
n 7
2016 Philip J. Ramsey, Ph.D. 18
Background Details: One Sample t-Test
Notice that numerator of the ratio is the difference between the
sample average and the hypothesized mean; this is a signal
measure that the null hypothesis may be invalid.
Large differences in the numerator would indicate evidence that
the null hypothesis is invalid.
The dominator is the standard error of the mean, the error in our
estimate of the true mean; this is a measure of noise in the data.
The ratio is therefore a type of signal to noise ratio based on the
assumed null hypothesis.
The associated p-value is the probability of getting an even larger
ratio in absolute value if the null hypothesis is true
p  value  P (t  0.47)  P (t  0.47)  0.65
2016 Philip J. Ramsey, Ph.D. 19
Background Details: One Sample t-Test
For Neyman significance testing we in advance fix an error rate, that
is a rate we will accept where we falsely reject the null hypothesis.
For the present case n = 7, so degrees of freedom n – 1 = 6, the error
rate  = 0.5, and for a student t curve with 6 dfs, the lower 0.025th
percentile (-2.447) and the upper 97.5th percentile (2.447).
Therefore, if our test statistic falls beyond either of these percentiles,
or the absolute value exceeds the upper bound, then we reject the
null hypothesis H0 .
No further inference is made, you either reject the null or fail to reject
the null, there is no statement of probability whatsoever.
With  = 0.5 we accept a risk that 1 out of 20 times we use this
method it will falsely reject a true null.
The inference only applies in the long run, not for a single test.
2016 Philip J. Ramsey, Ph.D. 20
Background Details: One Sample t-Test

0.4

0.3

Reject Reject
Do Not Reject Ho
0.2
Ho Ho

0.1

-0.47
0
-4 -3 -2 -1 0 1 2 3 4

The two tails shown include the 5% most extreme values.


The t value of -0.47 is not within these rejection regions (< -2.447 or
> 2.447) so we do not reject H0: µ = 3500.
2016 Philip J. Ramsey, Ph.D. 21
Background Details: One Sample t-Test
We can illustrate the concept of the significance of the t-Test and
with a simple simulation. Suppose the theoretical mean of a
process is µ = 175 and we perform 100 one sample t-Tests with
H0: µ = 175.
We expect about 5%
of the tests to reject
Reject Reject
at random despite Region Region
H0 being true.
This illustrates the long
run nature of a
significance test.

2016 Philip J. Ramsey, Ph.D. 22


One Sample t-Test Using JMP

A one sample t test can be performed in JMP using the


Distribution platform.
In Distribution, select Test Mean under the red arrow.
Input the hypothesized value in the dialog box, and click OK.

2016 Philip J. Ramsey, Ph.D. 23


One Sample t-Test Using JMP

Recall our null and alternative hypotheses for the lumens data:

H o :   3500 H A :   3500
Since the p-value Prob > |t| =
0.6540, and this value is large so
we fail to reject H0.
The true mean may in fact be
3500, since the sample average
sufficiently close to the
hypothesized value 3500.

2016 Philip J. Ramsey, Ph.D. 24


One Sample t-Test Using JMP
The Prob > |t| value is the
probability of seeing a value of
the sample average more distant
from 3500 than 3414.29.
The plot at the bottom of the panel
shows the distribution of sample
averages, assuming that H0 is true.
The plot shows the probability of
seeing a sample average farther
from 3500 than 3414.29 is large
(0.6540); the data is in agreement
with the null hypothesis.

2016 Philip J. Ramsey, Ph.D. 25


One Sample t-Test Using JMP
We will conduct a one sample t-Test for the Aluminum Lithium
data (Al_Li.jmp).
Recall that the data are compressive strengths for a new alloy,
measured in ksi.
Lets say that the target ksi is 175.
Our null and alternative hypotheses are:

H o :   175 H A :   175

Again, we will use the Distribution platform in JMP.

2016 Philip J. Ramsey, Ph.D. 26


One Sample t-Test Using JMP
Since the p-value Prob > |t| =
0.0016 is small, we reject
H0, and conclude that the true
mean is not 175 ksi.
Note the shaded area under the t
distribution.
The total area is 0.0016.
It is not very likely we’d
observe a sample mean of
162.66 if the true mean were
175 – we have strong
evidence against the null.
2016 Philip J. Ramsey, Ph.D. 27
One Sample t-Test Using JMP
Under the red arrow in the Test Mean report is an option called
Pvalue Animation, which can be used to help us understand the
physical meaning of a p-value.

0.11

0.10

0.09

0.08

0.07

0.06
Y

0.05

0.04

0.03

0.02

0.01

0.00
160 170 180 190
X

Sample Size = 80

2016 Philip J. Ramsey, Ph.D. 28


One Sample t-Test Using JMP

The curve represents the distribution of all sample averages, if the


true mean were at the hypothesized value (175 in this case).
The p-value, and the corresponding shaded area under the curve,
represent the probability of getting a sample mean smaller than
162.6 or greater than 187.4 if the true mean were really 175.
0.11

This probability is only 0.10

0.09

0.0016. 0.08

0.07

0.06

Since this is less than 0.05,


Y

0.05

0.04

we can conclude that the 0.03

0.02

true mean is not 175. 0.01

0.00
160 170 180 190
X

Sample Size= 80

2016 Philip J. Ramsey, Ph.D. 29


One Sample t-Test Using JMP
Suppose our null and
alternative hypotheses had
been:
H o :   170 H A :   170
Do we have evidence to reject
the null hypothesis?
The results are indeterminate
in this case since the p-value
is small but not too small.

2016 Philip J. Ramsey, Ph.D. 30


One-Tailed Tests

By default, JMP conducts the two-


tailed test, and two one-tailed
tests.
To test the hypothesis that the true
mean is greater than 170, we
would state the hypotheses as:
H o :   170 H A :   170
This can also be written as:
H o :   170 H A :   170

2016 Philip J. Ramsey, Ph.D. 31


One-Tailed Tests
The p-value corresponds to the
alternative hypothesis:
H A :   170
Here, we would reject Ho for a large
sample mean value relative to 170.
So, we use the p-value corresponding
to Prob > t.
Here, we would fail to reject the null
hypothesis due to the large p-value.
The null hypothesis is quite
probable given the sample mean of
162.66.

2016 Philip J. Ramsey, Ph.D. 32


One-Tailed Tests
To test the hypothesis that the true
mean is less than 170, we would
state the hypotheses as:
H o :   170 H A :   170
(or, H o :   170 H A :   170)
We reject if we have a small sample
mean value relative to 170.
So, we use the p-value
corresponding to Prob < t.
Here, we might reject Ho and conclude
that the true mean is < 170. The null
hypothesis is improbable given the
sample mean of 162.663.
2016 Philip J. Ramsey, Ph.D. 33
Further Comments on Hypothesis Tests
Neither Fisher nor Neyman advocated a threshold (like 0.05) be used
to evaluate the test results and both insisted that the subject matter
experts must weigh all of the evidence (see ASA statement).
Notice that neither method explicitly considers the effect size
observed in the test.
Subject matter experts must ultimately decide if the observed effect
size is sufficient to warrant rejecting the null.
Significance levels and p-values are not measures of scientific
importance and should not be interpreted in that manner.
Finally, hypothesis tests are sensitive to sample size n, for small n
there is little power to detect significant effects, so H0 is rarely
rejected, however for large n tests always reject regardless of the
size of the observed effect – this is a major problem!.
2016 Philip J. Ramsey, Ph.D. 34
Equivalence Testing
Sometimes the question of interest to an engineer or scientist is not
whether the population parameter differs from some specified
standard, but rather the question is can we assume that
population mean is close enough to a specified standard value
to be considered equivalent?
Hypothesis testing is used to test for differences in the population
parameter from a specified standard value.
Equivalence testing is used to try and infer that the population
parameter, the mean in our case, is within some specified
neighborhood of the standard value.
If the population mean is inferred to be within that neighborhood of
the standard, then we say they are equivalent; any difference
between standard value and the theoretical mean is unimportant.
2016 Philip J. Ramsey, Ph.D. 35
Equivalence Testing
The basic idea of equivalence testing is to reverse the null and
alternative hypothesis.
Under the null we assume that the population mean is outside of the
specified neighborhood of the standard value and if we reject the
null, then we assume the mean is within that neighborhood.
A rejection of the null hypothesis results in the acceptance of an
alternative hypothesis of equivalence.
The neighborhood about the standard is selected to define a range of
values for a difference between the population mean and the
standard value that is of no practical importance – this is a subject
matter expert decision.
Note, the FDA is considering the use of equivalence testing to
determine biosimilarity between biologics and equivalent small
molecule drugs.
2016 Philip J. Ramsey, Ph.D. 36
Equivalence Testing
Recall, when we fail to reject a null hypothesis it in no way implies
that we can accept the null hypothesis as true – we simply fail to
reject it.
The test only supplies evidence against the null in terms of the p-
values.
If our goal is to provide evidence for the null hypothesis, then we
should switch from hypothesis testing to equivalence testing.
Equivalence tests and hypothesis tests are similar ideas using two
one-sided t-Tests (in our case), but with profoundly different
inferences to be drawn.

2016 Philip J. Ramsey, Ph.D. 37


Equivalence Testing
To illustrate equivalence testing, we use the lamp CRI example
discussed in the confidence interval section.
Recall the standard or target CRI value for the lamp manufacturing
process is 63.
Engineers have decided that a difference of 6 from the target CRI is
not of any practical significance. In other words if the process is
off target by no more than 6, the engineers are unconcerned.
We will use equivalence testing to determine if the manufacturing
process mean can safely be assumed equivalent to 63.
The corresponding two one-sided null hypotheses are:
H 0 :   57 H A :   57
H 0 :   69 H A :   69
2016 Philip J. Ramsey, Ph.D. 38
Equivalence Testing
JMP does not perform a one-sample equivalence test.
However, one can construct such a test by doing two separate one-
sided t-Tests in the Distribution platform.

2016 Philip J. Ramsey, Ph.D. 39


Equivalence Testing
Below are the test results from the script.
If both null hypotheses are rejected we can safely assume that the
theoretical process mean is equivalent to 63.

2016 Philip J. Ramsey, Ph.D. 40


Skills Practice
Lets reconsider the file Pilot Biomolecule Titer.JMP.
Recall that we are interested in the yields for 25 randomly selected
batches.
We would like to test the hypothesis that the true average yield or
titer is 350 mg/L.
Steps:
• Write the null and alternative hypotheses.
• Perform a one sample t-Test in JMP. What can we conclude?
• Does the confidence interval lead us to draw the same conclusions
about true mean yield?
• Perform an equivalence test with difference  32 of no interest.
2016 Philip J. Ramsey, Ph.D. 41

You might also like