Professional Documents
Culture Documents
Inference
Statistical Inference
#4
• iis about using information in a sample to make
b i i f i i l k
estimates of the characteristics (parameters)
of the source population.
f h l i
Example
p
• A sample survey revealed:
– Proportion of smokers among a certain
group of population aged 15 to 24.
– Mean of SBP among sampled population
– Prevalence of HIV-positive among people
involved in the study
• When the estimate is of the form of a "range
of plausible values", it is called an interval
of plausible values it is called an interval
estimator.
1 Point Estimate
1.
• A single numerical value used to estimate
the corresponding population parameter.
Sample Statistics are Estimators of Population Parameters
Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
S
Sample
l Odd
Odds R
Ratio,
ti OŔ OR
Sample Relative Risk, RŔ RR
Sample correlation coefficient,
coefficient r ρ
2 IInterval
2. t lEEstimation
ti ti
• Interval estimation specifies
p a range
g of
reasonable values for the population
parameter based on a p
p point estimate.
From a Z table
F bl or a T table,
bl depending
d di
on the sampling distribution of the
statistic.
Upper
pp limit = Point Estimate + ((Critical Value)) x ((Standard Error))
is
i tto b
be chosen
h b
by th
the researcher,
h mostt common values
l off are 0.05,
0 05
0.01 and 0.1.
3. Commonly used CLs are 90%, 95%, and
99%
Finding the Critical Value
Margin
g of Error
(Precision of the estimate)
Example:
p
1. Waiting times (in hours) at a particular hospital are
believed to be approximately normally distributed
with a variance of 2.25 hr.
a. A sample of 20 outpatients revealed a mean
waiting time of 1.52 hours. Construct the 95% CI
for the estimate of the population mean.
b. Suppose that the mean of 1.52 hours had resulted
from a sample of 32 patients.
patients Find the 95% CI.
CI
c. What effect does larger sample size have on the
CI?
a
a. 2.25
1.52 1.96 1.52 1.96(.33)
20
1.52 .65 (.87, 2.17)
• We are 95% confident that the true mean waiting time is
between 0.87 and 2.17 hrs.
• Although
g the true mean may y or may
y not be in this interval, 95%
of the intervals formed in this manner will contain the true mean.
0.4
Z distribution
0.3
density
0.1
00
0.0
-5 0 5
Value
As the df gets larger, the student’s t-distribution looks more and more
like the SND with mean=0 and variance=1.
What happens to CI as sample gets larger?
s
x Z
For large samples:
g p
Z and t values
n become almost
become almost
identical, so CIs are
s
almost identical.
x t
n
Degrees of Freedom (df)
df = Number of observations that are free to vary after
sample mean has been calculated
df = n-1
Student’s
Student s t Table
t distribution
di ib i values
l
• With comparison to the Z value
Example
Example
• Standard error =
• t-value at 90% CL at 19 df =1.729
E
Exercise
i
• Compute a 95% CI for the mean birth
weight based on n = 10, sample mean =
116 9 oz and s =21.70.
116.9 =21 70
• From the t Table, t9, 0.975 = 2.262
• Ans:
A (101 4 132.4)
(101.4, 132 4)
2. CIs for single population
2
proportion, p
Hence,
•
Example
p 3
• Suppose that among 10,000 female operating-room
nurses, 60 women have developed breast cancer over
five years. Find the 95% for p based on point estimate.
• Point estimate = 60/10,000 = 0.006
• The 95% CI for p is given by the interval:
• Measurement might be
g
"pre/post”, "before/after", “right/left, “parent
/
/child”, etc.
,
Examples of paired data
1) Blood pressure prior to and following treatment,
2) Number of cigarettes smoked per week measured
prior to and following participation in a smoking
cessation program,
3) N b
3) Number of sex partners in the month prior to and in
f t i th th i t di
the month following an HIV education campaign.
• Notice in each of these examples p that the two
occasions of measurement are linked by virtue of the
two measurements being made on the same
individual.
• Longitudinal or follow‐up study
Paired differences
• If two measurements of the same
phenomenon
h (
(eg. bl d pressure, #
blood
cigarettes/week, etc) X and Y are measured
on an individual
i di id l andd if each
h is
i normally
ll
distributed, then their difference is also
di ib d normal.
distributed l
• SE of the difference =
• 95% CI
– Lower = ( p
point estimate ) - ((Zα/2) ((SE))
= 0.38 – (1.96)(0.0925) = 0.20
– Upper = ( point estimate ) + (Zα/2) (SE)
= 0.38
0 38 + (1.96)(0.0925)
(1 96)(0 0925) = 0
0.56
56
• 95% CI = (0.20, 0.56)
Hypothesis Testing
One type of statistical inference
• The majority of statistical analyses involve
comparison, most obviously between
treatments or procedures or between groups
of subjects.
• Hypotheses are formulated, experiments are
performed, and results are evaluated for their
consistency (non-consistency) with a
hypothesis.
hypothesis
• Hypothesis Testing (HT) provides an
objective framework for making decisions
using probabilistic methods
Hypothesis
• Is a statement about one or more
• Is a claim (assumption) about a population
parameter
• Is frequently concerned with the parameters
off the
h population
l i about
b which
hi h the
h statement
is made.
Examples of Research Hypotheses
Population Mean
• The average length of stay of patients
admitted to the hospital is five days
• The mean birthweight of babies
delivered byy mothers with low SES is
lower than those from higher SES.
• Etc
Population Proportion
• The pproportion
p of adult smokers in Addis
Ababa is p = 0.40
• The prevalence of HIV among non non‐married
married
adults is higher than that in married adults
• Etc
Etc
Types of Hypothesis
1. The Null Hypothesis, H0
IIs a statement
t t t claiming
l i i th t there
that th i no
is
difference between the hypothesized value
and the population value.
value
(The effect of interest is zero = no difference)
States the assumption (hypothesis) to be
tested
H0 isi a statement off agreement (or
( no
difference)
H0 is
i always
l about
b t a population
l ti parameter,
t
not about a sample statistic
• Begin with the assumption that the Ho is true
– Similar to the notion of innocent until proven
p
guilty
• Always contains
Always contains “=” , , “ ≤≤” or
or “≥≥ ” sign
sign
• May or may not be rejected
2. The
2 Th Alternative
Alt ti Hypothesis,
H th i HA
• Is a statement of what we will believe is true
if our sample data causes us to reject Ho.
• Is generally the hypothesis that is believed
(or needs to be supported) by the researcher.
• Is a statement that disagrees (opposes)
with Ho
(The effect of interest is not zero)
Never contains “=” , “ ≤” or “≥ ” sign
• May or may not be accepted
ay o ay ot be accepted
Steps in Hypothesis Testing
1. Formulate the appropriate statistical
h
hypotheses
th clearly
l l
• Specify HO and HA
H0: = 0 H0: ≤ 0 H0: ≥ 0
H1: 0 H1: > 0 H1: < 0
two tailed
two-tailed one tailed
one-tailed one-tailed
one tailed
2. State the assumptions necessary for
computing probabilities
• A distribution is approximately normal (Gaussian)
• Variance is known or unknown
3 Select
3. S l t a sample
l and
d collect
ll t ddata
t
• Categorical, continuous
4. Decide on the appropriate test statistic
for the hypothesis. E.g., One population
OR
5. Specify the desired level of significance
for the statistical test (=0.05, 0.01, etc.)
6. Determine the critical value.
– A value the test statistic must attain to be
declared significant.
= 0.025 = 0.025
0.95
-1.96 1.96
Rejection region Non-rejection region Rejection region
Statistical Decision
• Reject Ho if the value of the test
statistic that we compute from our
sample is one of the values in the
rejection region
• Don’t reject Ho if the computed value of
the test statistic is one of the values in
the non-rejection
non rejection region.
region
Level of Significance,
g ,α
• Is the probability of rejecting a true Ho
• Defines unlikely values of sample statistic if Ho is
true
– Defines rejection region
Defines rejection region of the sampling distribution
of the sampling distribution
• The decision is made on the basis of the level of
ssignificance,
g f ca ce, des
designated
g ated by α
α.
• More frequently used values of α are 0.01, 0.05 and
0.10.
• α is selected by the researcher at the beginning
O tail
One t il and
d two
t tail
t il tests
t t
• In a one tail test,
test the rejection region is
at one end of the distribution or the
other.
other
• In a two tail test, the rejection region is
split between the two tails.
tails
• Which one is used depends on the way
th Ho
the H isi written.
itt
Level of Significance
and the Rejection Region
Example:
• The average survival year after cancer
diagnosis is less than 3 years.
Another way to state conclusion
• Reject Ho if P-value < α
• Accept Ho if P-value ≥ α
G. Statistical decision
We reject the Ho because Z = -2.12 is in the rejection
region. The value is significant at 5% α.
H Conclusion
H.
We conclude that μ is not 30. P-value = 0.0340
• Confidence interval
Example: One -Tailed
Tailed Test
• A simple
p random sample p of 10 ppeople
p from a certain
population has a mean age of 27. Can we conclude that
the mean age of the population is less than 30? The
variance is known to be 20.
20 Let α = 0.05.
0 05
• Data
n = 10, 27 2 = 20,
10 sample mean = 27, 20 α = 0.05
0 05
• Hypotheses
Ho: μ ≥ 30, HA: μ < 30
• Test
T t statistic
t ti ti
• Rejection Region
• With α = 0.05 and the inequality, we have the entire rejection region
at the left. The critical value will be Z = -1.645. Reject
j Ho if Z < -
1.645.
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that μ < 30.
– p = .0170 this time because it is only a one tail test and not a two
tail test.
• Suppose that the Ho and HA take the form
Ho: μ = μo, HA: μ > μo
• In
I this
thi case, Ho
H would ld b
be rejected
j t d ffor llarge
values of test statistic (critical values >0)
• The P-value would correspond to the area
in the upper tail of the SND, to the right of
the value of the test statistic.
• Calculation of Pooled
Variance
• Hypotheses:
Ho: μ1 ≤ μ2 = 0, HA: μ1 > μ2
• With α = 0.05
0 05 and df = 23,
23 the critical value of t is 1
1.7139.
7139 We
reject Ho if t > 1.7139.
• Test statistic
H0 : p = 0.014
HA: p ≠ 0.014
• The test statistic is given by:
• Th
The critical
iti l value
l off Zα/2 att α=5%
5% iis ±1.96.
±1 96
• Don’t reject Ho since Z (=1.14) in the non-
rejection region between ±1.96.
• P-value = 0.2548
• We do not have sufficient evidence to
conclude that the pprobability
y of developing
p g
asthma for children whose mothers smoke
in the home is different from the probability
p y
in the general population
4. Hypothesis Tests about the Difference
Between
Two Population Proportions
Where X1 = the observed number of events in the first sample
and X2 = the observed number of events in the second sample
Example
p
• A study was conducted to investigate the
possible cause of g gastroenteritis outbreak
following a lunch served in a high school
cafeteria. Among the 225 students who ate the
sandwiches,
d i h 109 became
b ill While,
ill. Whil among the
th
38 students who did not eat the sandwiches, 4
became ill.ill Is there a significant difference
between the two groups at α =5%.
• We wish to test
Ho: p1 = p2 against the alternative
HA: p1 ≠ p2
• Assume that the sample sizes are large
enough, and the normal approximation to
the binomial distribution is valid.
valid
• If the Ho is true, then p1 = p2 = p
The area under the standard normal curve to the
right of 4.36
4 36 is less than 0.0001.
0 0001 Therefore,
Therefore p <
0.0002. We reject H0 at the 0.05 level.