This action might not be possible to undo. Are you sure you want to continue?
1
Hypothesis Testing
• Goal: Make statement(s) regarding unknown population
parameter values based on sample data
• Elements of a hypothesis test:
– Null hypothesis  Statement regarding the value(s) of unknown
parameter(s). Typically will imply no association between
explanatory and response variables in our applications (will
always contain an equality)
– Alternative hypothesis  Statement contradictory to the null
hypothesis (will always contain an inequality)
– Test statistic  Quantity based on sample data and null
hypothesis used to test between null and alternative hypotheses
– Rejection region  Values of the test statistic for which we
reject the null in favor of the alternative hypothesis
Hypothesis Testing
Test Result –
True State
H
0
True H
0
False
H
0
True Correct
Decision
Type I Error
H
0
False Type II Error Correct
Decision
) ( ) ( Error II Type P Error I Type P = =  o
• Goal: Keep o,  reasonably small
Example  Efficacy Test for New drug
• Drug company has new drug, wishes to compare it
with current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than current
treatment to receive approval
• Firm runs clinical trial where some patients
receive new drug, and others receive standard
treatment
• Numeric response of therapeutic effect is obtained
(higher scores are better).
• Parameter of interest: µ
New
 µ
Std
Example  Efficacy Test for New drug
• Null hypothesis  New drug is no better than standard trt
( ) 0 0 :
0
= ÷ s ÷
Std New Std New
H µ µ µ µ
• Alternative hypothesis  New drug is better than standard trt
0 : > ÷
Std New A
H µ µ
• Experimental (Sample) data:
Std New
Std New
Std New
n n
s s
y y
Sampling Distribution of Difference in Means
• In large samples, the difference in two sample means is
approximately normally distributed:


.

\

+ ÷ ÷
2
2
2
1
2
1
2 1
2 1 , ~
n n
N Y Y
o o
µ µ
• Under the null hypothesis, µ
1
µ
2
=0 and:
) 1 , 0 ( ~
2
2
2
1
2
1
2 1
N
n n
Y Y
Z
o o
+
÷
=
• o
1
2
and o
2
2
are unknown and estimated by s
1
2
and s
2
2
Example  Efficacy Test for New drug
• Type I error  Concluding that the new drug is better than the
standard (H
A
) when in fact it is no better (H
0
). Ineffective drug is
deemed better.
– Traditionally o = P(Type I error) = 0.05
• Type II error  Failing to conclude that the new drug is better
(H
A
) when in fact it is. Effective drug is deemed to be no better.
– Traditionally a clinically important difference (A) is assigned
and sample sizes chosen so that:
 = P(Type II error  µ
1
µ
2
= A) s .20
2
Elements of a Hypothesis Test
• Test Statistic  Difference between the Sample means,
scaled to number of standard deviations (standard errors)
from the null difference of 0 for the Population means:
2
2
2
1
2
1
2 1
: . .
n
s
n
s
y y
z S T
obs
+
÷
=
• Rejection Region  Set of values of the test statistic that are
consistent with H
A
, such that the probability it falls in this
region when H
0
is true is o (we will always set o=0.05)
645 . 1 05 . 0 : . . = ¬ = >
o o
o z z z R R
obs
Pvalue (aka Observed Significance Level)
• Pvalue  Measure of the strength of evidence the sample
data provides against the null hypothesis:
P(Evidence This strong or stronger against H
0
 H
0
is true)
) ( :
obs
z Z P p val P > = ÷
LargeSample Test H
0
:µ
1
µ
2
=0 vs H
0
:µ
1
µ
2
>0
• H
0
: µ
1
µ
2
= 0 (No difference in population means
• H
A
: µ
1
µ
2
> 0 (Population Mean 1 > Pop Mean 2)
) ( :
: . .
: . .
2
2
2
1
2
1
2 1
obs
obs
obs
z Z P value P
z z R R
n
s
n
s
y y
z S T
> ÷ •
> •
+
÷
= •
o
• Conclusion  Reject H
0
if test statistic falls in rejection region,
or equivalently the Pvalue is s o
Example  Botox for Cervical Dystonia
• Patients  Individuals suffering from cervical dystonia
• Response  Tsui score of severity of cervical dystonia
(higher scores are more severe) at week 8 of Tx
• Research (alternative) hypothesis  Botox A
decreases mean Tsui score more than placebo
• Groups  Placebo (Group 1) and Botox A (Group 2)
• Experimental (Sample) Results:
35 4 . 3 7 . 7
33 6 . 3 1 . 10
2 2 2
1 1 1
= = =
= = =
n s y
n s y
Source: Wissel, et al (2001)
Example  Botox for Cervical Dystonia
0024 . ) 82 . 2 ( :
645 . 1 : . .
82 . 2
85 . 0
4 . 2
35
) 4 . 3 (
33
) 6 . 3 (
7 . 7 1 . 10
: . .
0 :
0 :
05 .
2 2
2 1
2 1 0
= > ÷ •
= = > •
= =
+
÷
= •
> ÷ •
= ÷ •
Z P val P
z z z R R
z S T
H
H
obs
obs
A
o
µ µ
µ µ
Test whether Botox A produces lower mean Tsui
scores than placebo (o = 0.05)
Conclusion: Botox A produces lower mean Tsui scores than
placebo (since 2.82 > 1.645 and Pvalue < 0.05)
2Sided Tests
• Many studies don’t assume a direction wrt the
difference µ
1
µ
2
• H
0
: µ
1
µ
2
= 0 H
A
: µ
1
µ
2
= 0
• Test statistic is the same as before
• Decision Rule:
– Conclude µ
1
µ
2
> 0 if z
obs
> z
o/2
(o=0.05 ¬ z
o/2
=1.96)
– Conclude µ
1
µ
2
< 0 if z
obs
> z
o/2
(o=0.05 ¬z
o/2
= 1.96)
– Do not reject µ
1
µ
2
= 0 if z
o/2
s z
obs
s z
o/2
• Pvalue: 2P(Z> z
obs
)
3
Power of a Test
• Power  Probability a test rejects H
0
(depends on µ
1
 µ
2
)
– H
0
True: Power = P(Type I error) = o
– H
0
False: Power = 1P(Type II error) = 1
J Example:
J H
0
: µ
1
 µ
2
= 0 H
A
: µ
1
 µ
2
> 0
• o
1
2
= o
2
2
= 25 n
1
= n
2
= 25
J Decision Rule: Reject H
0
(at o=0.05 significance level) if:
326 . 2 645 . 1
2
2 1
2 1
2
2
2
1
2
1
2 1
> ÷ ¬ >
÷
=
+
÷
= y y
y y
n n
y y
z
obs
o o
Power of a Test
• Now suppose in reality that µ
1
µ
2
= 3.0 (H
A
is true)
• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at least
2.326 higher than the sample mean for group 2
• Power for this case can be computed as:
) 414 . 1 0 . 2 , 3 ( ~ ) 326 . 2 ( 2 1 2 1 = ÷ > ÷ N Y Y Y Y P
Power of a Test
6844 . ) 48 . 0
41 . 1
3 326 . 2
( ) 326 . 2 ( 2 1 = ÷ =
÷
> = > ÷ = Z P Y Y P Power
• All else being equal:
• As sample sizes increase, power increases
• As population variances decrease, power increases
• As the true mean difference increases, power increases
Power of a Test
Distribution (H
0
) Distribution (H
A
)
Power of a Test
Power Curves for group sample sizes of 25,50,75,100 and
varying true values µ
1
µ
2
with o
1
=o
2
=5.
• For given µ
1
µ
2
, power increases with sample size
• For given sample size, power increases with µ
1
µ
2
Sample Size Calculations for Fixed Power
• Goal  Choose sample sizes to have a favorable chance of
detecting a clinically meaning difference
• Step 1  Define an important difference in means:
– Case 1: o approximated from prior experience or pilot study  dfference
can be stated in units of the data
– Case 2: o unknown  difference must be stated in units of standard
deviations of the data
o
µ µ
o
2 1
÷
=
• Step 2  Choose the desired power to detect the the clinically
meaningful difference (1, typically at least .80). For 2sided test:
( )
2
2
2 /
2 1
2
o
 o
z z
n n
+
= =
4
Example  Rosiglitazone for HIV1
Lipoatrophy
• Trts  Rosiglitazone vs Placebo
• Response  Change in Limb fat mass
• Clinically Meaningful Difference  0.5 (std dev’s)
• Desired Power  1 = 0.80
• Significance Level  o = 0.05
( )
63
) 5 . 0 (
84 . 0 96 . 1 2
84 . 96 . 1
2
2
2 1
20 . 2 /
=
+
= =
= = =
n n
z z z
 o
Source: Carr, et al (2004)
Confidence Intervals
• Normally Distributed data  approximately 95% of
individual measurements lie within 2 standard
deviations of the mean
• Difference between 2 sample means is
approximately normally distributed in large
samples (regardless of shape of distribution of
individual measurements):


.

\

+ ÷ ÷
2
2
2
1
2
1
2 1
2 1 , ~
n n
N Y Y
o o
µ µ
• Thus, we can expect (with 95% confidence) that our sample
mean difference lies within 2 standard errors of the true difference
(1o)100% Confidence Interval for µ
1
µ
2
( )
2
2
2
1
2
1
2 / 2 1
n
s
n
s
z y y + ± ÷
o
• Large sample Confidence Interval for µ
1
µ
2
:
• Standard level of confidence is 95% (z
.025
= 1.96 ~ 2)
• (1o)100% CI’s and 2sided tests reach the same
conclusions regarding whether µ
1
µ
2
= 0
Example  Viagra for ED
• Comparison of Viagra (Group 1) and Placebo (Group 2)
for ED
• Data pooled from 6 doubleblind trials
• Subjects  White males
• Response  Percent of succesful intercourse attempts in
past 4 weeks (Each subject reports his own percentage)
240 3 . 42 5 . 23
264 3 . 41 2 . 63
2 2 2
2 1 1
= = =
= = =
n s y
n s y
95% CI for µ
1
 µ
2
:
) 0 . 47 , 4 . 32 ( 3 . 7 7 . 39
240
) 3 . 42 (
264
) 3 . 41 (
96 . 1 ) 5 . 23 2 . 63 (
2 2
÷ ± ÷ + ± ÷
Source: Carson, et al (2002)
: z obs • P value : P ( Z • Conclusion . : z obs 10. or equivalently the Pvalue is Example .4 n2 = 35 Source: Wissel.Individuals suffering from cervical dystonia • Response .82 0.Botox for Cervical Dystonia • Patients .Set of values of the test statistic that are consistent with HA.Botox for Cervical Dystonia Test whether Botox A produces lower mean Tsui scores than placebo ( = 0.Difference between the Sample means. such that the probability it falls in this region when H0 is true is (we will always set =0.05 – Do not reject µ1µ2 = 0 if z /2 zobs z /2 z /2=1.R.R.82) = .1 7.05 – Conclude µ1µ2 < 0 if zobs z /2 ( =0.05 z = 1. : zobs = y1 y2 2 s12 s2 + n1 n2 P val : p = P( Z zobs ) • Rejection Region .96) z /2= 1.Placebo (Group 1) and Botox A (Group 2) • Experimental (Sample) Results: y1 = 10.Botox A decreases mean Tsui score more than placebo • Groups . : z obs = y1 y2 s2 s 12 + 2 n1 n2 z z obs ) • R .6) 2 (3.Tsui score of severity of cervical dystonia (higher scores are more severe) at week 8 of Tx • Research (alternative) hypothesis .S .4 = 2.645 2.96) • P val : P( Z • Pvalue: 2P(Z zobs) Conclusion: Botox A produces lower mean Tsui scores than placebo (since 2.Reject H0 if test statistic falls in rejection region.Elements of a Hypothesis Test • Test Statistic .645 and Pvalue < 0.05) 2 .05) R.1 s1 = 3.05) • H 0 : µ1 µ 2 = 0 • H A : µ1 µ 2 > 0 • T .7 s2 = 3.85 2Sided Tests • Many studies don’t assume a direction wrt the difference µ1µ2 • H0: µ1µ2 = 0 HA: µ1µ2 0 • Test statistic is the same as before • Decision Rule: – Conclude µ1µ2 > 0 if zobs z /2 ( =0.645 LargeSample Test H0:µ1µ2=0 vs H0:µ1µ2>0 • H0: µ1µ2 = 0 (No difference in population means • HA: µ1µ2 > 0 (Population Mean 1 > Pop Mean 2) Example .Measure of the strength of evidence the sample data provides against the null hypothesis: P(Evidence This strong or stronger against H0  H0 is true) T . scaled to number of standard deviations (standard errors) from the null difference of 0 for the Population means: Pvalue (aka Observed Significance Level) • Pvalue .6 n1 = 33 y 2 = 7.0024 = 2.05 = 1.7 (3. et al (2001) • T .82 > 1.S .4) 2 + 33 35 z = z. : zobs = • R. S . : zobs z = 0. R .
Choose the desired power to detect the the clinically meaningful difference (1. For 2sided test: n1 = n2 = 2(z /2 +z 2 ) 2 3 .48) = .difference must be stated in units of standard deviations of the data = Power Curves for group sample sizes of 25. typically at least .Define an important difference in means: – Case 1: approximated from prior experience or pilot study .Choose sample sizes to have a favorable chance of detecting a clinically meaning difference • Step 1 . 2. power increases with µ1µ2 µ1 µ 2 • Step 2 . • Decision Rule (from last slide): Conclude population means differ if the sample mean for group 1 is at least 2.100 and varying true values µ1µ2 with 1= 2=5. • For given µ1µ2 .Probability a test rejects H0 (depends on µ1.50. power increases • As population variances decrease.6844 1.645 y1 y2 2 .0 and standard deviation (standard error) 1.80).0 = 1.326 n1 n2 P (Y 1 Y 2 2.. power increases with sample size • For given sample size.414) Power of a Test Power= P(Y 1 Y 2 2.41 • As sample sizes increase.µ2) – H0 True: Power = P(Type I error) = – H0 False: Power = 1P(Type II error) = 1 Power of a Test • Now suppose in reality that µ1µ2 = 3. power increases • As the true mean difference increases.326) Y 1 Y 2 ~ N (3.326 higher than the sample mean for group 2 • Power for this case can be computed as: J Example: J H0: µ1. power increases Power of a Test Sample Size Calculations for Fixed Power • Goal .414.dfference can be stated in units of the data – Case 2: unknown .µ2 = 0 HA: µ1.0 (HA is true) • Power now refers to the probability we (correctly) reject the null hypothesis.326) = P(Z • All else being equal: Power of a Test Distribution (H0) Distribution (HA) 2.05 significance level) if: z obs = y1 2 1 y2 + 2 2 = y1 y2 2 1 .326 3 = 0.75. Note that the sampling distribution of the difference in sample means is approximately normal. with mean 3.Power of a Test • Power .µ2 > 0 • 12 = 22 = 25 n1 = n2 = 25 J Decision Rule: Reject H0 (at =0.
96 z = z. 2 1 z /2 = 1.5) 2 Source: Carr.= 0.= 0.White males • Response .7 ± 7.Rosiglitazone for HIV1 Lipoatrophy • • • • • Trts .3 n2 = 240 (63.2 s1 = 41. we can expect (with 95% confidence) that our sample mean difference lies within 2 standard errors of the true difference (1.5) ±1.47.05 Confidence Intervals • Normally Distributed data .Example .3)2 + 264 240 39.80 Significance Level .)100% Confidence Interval for µ1µ2 • Large sample Confidence Interval for µ1µ2: Example .025 = 1.96 • (1. et al (2002) (41.3)2 (42.5 (std dev’s) Desired Power .4.)100% CI’s and 2sided tests reach the same conclusions regarding whether µ1µ2= 0 2) 95% CI for µ1.Percent of succesful intercourse attempts in past 4 weeks (Each subject reports his own percentage) y1 = 63.approximately 95% of individual measurements lie within 2 standard deviations of the mean • Difference between 2 sample means is approximately normally distributed in large samples (regardless of shape of distribution of individual measurements): Y 1 Y 2 ~ N µ1 µ 2 .0) 4 .5 s2 = 42. et al (2004) n1 + 2 2 n2 • Thus.Viagra for ED • Comparison of Viagra (Group 1) and Placebo (Group 2) for ED • Data pooled from 6 doubleblind trials • Subjects .0.96 Source: Carson.Change in Limb fat mass Clinically Meaningful Difference .Rosiglitazone vs Placebo Response .1.2 23.84 2 2(1.84) n1 = n2 = = 63 (0.µ2: y 2 = 23.20 = .96 + 0.3 (32.3 n2 = 264 (y 1 y2 ± z ) /2 2 s12 s2 + n1 n2 • Standard level of confidence is 95% (z.