You are on page 1of 27

Test Values and p-Values

Lecture Notes #8B


OSU Statistics 2013
Summer 2017
Adam Molnar
Steps in Hypothesis Testing
① State the hypotheses.
② Design the study – appropriate test, with conditions,
including the criteria to reject or not reject.
③ Conduct study, compute test value using the data.
④ Evaluate data – make decision to reject or not reject
the null hypothesis.
⑤ Summarize the results.

 This appears on page 412. I’m using these five steps


because they work in general, no matter the approach
to evaluating the data.
observed - null
test value =
Test Value SE
 A test value is a measure of how extreme the
observed value is, when compared to the expected
value under the null hypothesis.
 The numerator is the difference between the observed
value minus the null value.
 Observed: x̅ or p̂ Null: m or p
 The denominator is a measure of error. For us, it’s
standard error under the null hypothesis.
𝜎
SE = , substituting for s if needed
𝑛
 Test values are labeled Z or t in Bluman’s book,
sometimes Z* or t*.
When do we reject the null hypothesis?
 We reject the null hypothesis
when test value is more
extreme than critical value.
 For two-sided alternatives,
reject when test value is
outside in either direction.
Z <= Za/2 or Z >= Za/2
t <= ta/2 or t >= ta/2
 For one-sided alternatives,
reject when test value is
outside in correct direction.
Less than: Z <= Za, t <= ta
Greater than: Z >= Za, t >= ta
Solar Flares
 Solar flares on the sun release enormous amount of energy,
which can harm spacecraft and satellites. Forty years ago solar
flares occurred on 20% of days. A more recent study reported
that solar flares occurred on 86 of 350 randomly selected days.
Has the proportion of days with solar flares significantly
changed from 40 years ago?
 What are the null and alternative hypotheses?

H0: H1:
Solar Flares, Steps 1 and 2
 Solar flares on the sun release enormous amount of energy,
which can harm spacecraft and satellites. Forty years ago solar
flares occurred on 20% of days. A more recent study reported
that solar flares occurred on 86 of 350 randomly selected days.
Has the proportion of days with solar flares significantly
changed from 40 years ago?
 What are the null and alternative hypotheses?

H0: p = 0.20 H1: p ≠ 0.20

 Physical scientists tend to use small a values; let’s use a = 0.01.


On any exam question, I will give you the a value.
 With a two-sided alternative, and a proportion Z test,
the critical value is Z0.01/2 = 2.58. Test values more extreme than
+2.58 or –2.58 will cause us to reject the null hypothesis.
Solar Flares Steps 3 through 5
 Observed value was p̂ = 86 / 350 = 0.2457
 Remember, we believe the null hypothesis is true. Assuming the
null hypothesis is true, the substitution for s is 𝑝 (1 − 𝑝).
We use p, not p̂, in hypothesis testing. This is different than
confidence intervals, but it rarely changes the outcome.
𝑝 (1−𝑝)
 SE = = 0.0214
𝑛
0.2457 −0.20
 Test value Z = = 2.1355
0.0214
 Step 4: Do not reject the null hypothesis because –2.58 < 2.1355
and 2.1355 < 2.58, thus the test value is not in the rejection
region.
 Step 5: Based on the data, the proportion of days with solar
flares has not significantly changed from 40 years ago.
Conditions Still Matter
 Conducting hypothesis test for population mean
• A) Independent or other good random sample from large pop.
• Either B1) normally distributed population
or B2) sample size large enough n >= 30
• If s known, use Z test
• If s unknown, substitute s and use t test with (n-1) DF
 Conducting hypothesis test for population proportion
• Binomial conditions: fixed number of trials, only two
outcomes, independent sample, (assumed) same probability
• At least 5 successes and 5 failures
• Use proportion Z test, substitute 𝑝 (1 − 𝑝) for s
Iraq Identification
 In 2006, the population of 18-24 year olds had 37%
identification rate of Iraq when looking at the Middle
East map on the next slide.
 Say my fall 2015 class was a “good” representative
sample of OSU students.
 The null hypothesis is that OSU students are no
different from all 18-24 year olds, p = 0.37.
 What should be the alternative hypothesis?
• The class was Better (one-sided)?
• Worse (one-sided)?
• Different (two-sided)?
Graph from National Geographic
The Optimistic Alternative
 I chose the  Out of 67 students, 40
greater than alternative. identified Iraq (#10!)
① H0: p = 0.37  Binomial assumptions
H1: p > 0.37 all hold, and there are
 Let a = 0.01. 5 successes and 5
 This is a proportion Z test.
failures.
② We reject when Z > Za
We look up Z0.01 = 2.33,
so we reject the null
when Z >= +2.33.
Conducting the Test
 From the sample, observed p̂ = 40/67 = 0.5970
 Null value is p = 0.37
𝑝(1−𝑝) 0.37(1−0.37)
 Standard error is = = 0.0590
𝑛 67
0.5970−0.37
③ Test value is 𝑍 ∗ = = 3.8475
0.0590
④ Since 3.8475 > 2.33, we reject the null hypothesis.
⑤ Based on the sample, we conclude that the
population proportion of Oklahoma State students
who can identify Iraq is statistically significantly
more than 37%.
Another example: MCAT Scores
 Scores on the new Medical College Admissions Test
(MCAT) are designed to have an approximate normal
distribution, with mean m = 500 and SD s = 10.
 A study group of 6 students took the test, with an
average score of 506. They want to know if they did
“significantly better than average at the 95% level.”
This isn’t good wording, but let’s try a test.
① H0: group mean = 500 H1: group mean > 500

② With SD s known, this will be a Z test.


a = 0.05 since they asked for the 95% level.
Since Z0.05 = 1.65, we reject null when Z >= +1.65.
Conditions and Completing Test
 Conditions: Although the population is normally
distributed, so we can use B1), the study group really
isn’t A) a good sample from a large population.
 But … this is a popular type of test anyway. Oh well.
506−500
③ Test value Z = 10 = 1.47
ൗ 6

④ Since 1.47 < 1.65, we do not reject the null


hypothesis.
⑤ Based on the data, there is not enough evidence to
claim that the group is significantly better than the
MCAT average.
Whale Weight
 For a certain small species of whale,
average male adult weight was
measured 10 years ago as 1230 lb.
 A new survey is about to be
conducted. Potential weight changes
in either direction are unknown.
① H0: m = 1230 lb.
H1: m ≠ 1230 lb.
 As an exploratory study, let’s set a
low barrier with a = 0.10.
 Nothing is known about the
standard deviation of whale weight.
Whale Computational Steps
 With SD unknown, we will ③ Compute test value:
use a t test with sample SD. t=
1200−1230
100
ൗ 35
 We collect a sample of 35
t = –1.7748
adult whales, who had
x̅ = 1200 lb. and s = 100 lb. ④ Because t < –t0.10/2, we reject
the null H0
 Choose t-test with
(35 – 1) = 34 DF. ⑤ Based on our sample, we
conclude that average adult
② As a two sided test, reject if
male whale weight in the
t <= –t0.10/2, 34 DF
population is not 1230 lbs.
or t >= +t0.10/2, 34 DF
 From t table, critical value
t0.10/2 = 1.691
Changing Significance Level
 What would have happened if a = 0.02 in the whale
problem, still with 35 – 1 = 34 DF?

 We change the critical value to t0.02/2 = 2.441.


Now t = –1.7748 > –2.441 and
we would not reject the null at the a = 0.02 level.
 Perhaps we need a way to report a level of evidence?
 That way, called a P-value, can be very useful,
but also is VERY frequently misinterpreted.
It’s next.
The P-value
 The P-value is the probability of getting a
sample statistic or a more extreme sample
statistic in the direction of the alternative
hypothesis when the null hypothesis is true.
 “Direction” can be one-tailed or two-tailed.
 Two-tailed P-value for whales = 0.0849 (in red).
Computing P-values
 We could use calculus to find P-values,
but we almost always use tables or the computer
(like http://www.imathas.com/stattools/norm.html).
 For a Z score, we use a normal table to find the
appropriate probability area, like we did earlier.
 For the MCAT data, P(Z > 2.69) = 1 – 0.9964 = 0.0036.
Approximate P-values from Table F
 For t tests, we approximate P-values from Table F.
 Whale sample had 34 DF, test value = –1.7748
 For two-tailed test, –1.7748 is between 0.05 and 0.10,
so approximate P-value is 0.05 < P-value < 0.10.
Comparing p and a
 If P-value ≤ a , we reject the null hypothesis and conclude the
test is statistically significant.
 If P-value > a , we do not reject the null hypothesis.
 If p = 0.0849 and a = 0.02, a < p and the test is not significant;
if p = 0.0849 and a = 0.10, a ≥ p and the test is significant.
 “Small P-value rejects; large P-value does not reject.”
Warnings about p-value
 The P-value is NOT the probability that
the null hypothesis is true!
 The P-value is NOT the probability of
falsely rejecting H0 given H0 is true.
 The P-value is NOT the probability of
unsuccessful replication.

 The P-value IS a tool to help in decision


making. Given H0 is true, the p-value is
the probability of generating a result as
or more extreme due to random chance.
Low Birth Weight
 In the United States, average birth weight m = 3300
grams with population s = 490 grams.
 Pregnant women in poverty may have stress factors
that would reduce their average birth weight.
 For a random sample of 32 women in poverty, the
sample mean birth weight was 3075 grams.
 At a = 0.05, did the babies of the women in poverty
have lower birth weight?
 We’ll use the p-value approach: Find a p-value then
compare a against p.
Pregnant Women in Poverty Z test
① H0: m = 3300 grams H1: m < 3300
② Decision rule: Reject if P-value ≤ a = 0.05
Conditions are okay: 1) random sample, 2B) N > 30
3080−3300
③ Test value Z = 490 = –2.5398
ൗ 32
Using a normal table, we need to find exact P-values.
With Z = –2.54 rounded, and a one-tailed test,
P(Z <= –2.54) = 0.0055.
④ Since the P-value of 0.0055 < 0.0500, we reject the null.
⑤ Based on the data, women in poverty appear to have
significantly lower average baby birth weights than the
population.
Moving Vans
(Using P-values)

 At a moving truck company, their 16 foot moving trucks


currently get an average of 9.5 mpg. The company believes truck
mpg is roughly normally distributed.
 Perhaps installing devices similar to Trailer Tails (shown above)
might improve mileage. On the other hand, moving trucks do
not have the same shape as big trucks. MPG might get worse.
 The moving truck company installs devices on a random sample
of 24 trucks. On these trucks, sample mean is 9.925 mpg,
sample SD = 1.189.
 At the 5% level of significance, was there a statistically
significant change in mpg?
① H0: m = 9.5 mpg H1: m ≠ 9.5 mpg

② Reject H0 if P-value ≤ a = 0.05. Conditions are satisfied:


A) random sample and B1) population normally distributed.
③ t = 1.751. With 23 DF, two-tailed p-value is somewhere
between 0.10 (t = 1.711) and 0.05 (t = 2.064).
I used the web applet to find the exact p-value = 0.0933.
④ Since P-value > 0.05, we do not reject the null.
⑤ Based on the data, there is no statistically significant change in
moving truck mpg when Trailer Tails are installed.
Summary of Ways to Conduct Tests
1. Confidence Interval: Find (1-a) confidence interval,
reject null hypothesis if null value is outside interval
2. Critical Value: Compute the test value,
reject null hypothesis if test value is in rejection
region, more extreme than Z or t critical value
3. P-Value: Compute the test value and find P-value,
reject null hypothesis if P-value < a level

 Unfortunately, no consensus exists on the best


method, so we have to understand all three ways.

You might also like