You are on page 1of 53

Measures of Association

Why do we use statistics in


epidemiology?
Generalization of conclusions:
sample population
Assess strength of evidence
Make comparisons
Make predictions
Role of Statistics in Public
Health and Medicine

Science Statistics
1. Idea or Question 1. Stats model /
hypothesis
2. Collect data and
make observations 2. Study design
3. Describe data 3. Descriptive statistics

4. Assess the 4. Inferential statistics


strength of
evidence for or
against the
hypothesis
There are basically two main fields in
statistics:

Point Estimation
For example, estimating Relative Risks

Hypothesis Testing
For example, testing if Relative Risk =
1
Types of Data

Categorical (qualitative)
Nominal scale - no natural order
gender, marital status, race
Ordinal scale
severity scale, good/better/best
Types of Data

Numerical (quantitative)
Discrete - (few) integer values
number of children in a family
Continuous - measure to arbitrary
precision
blood pressure, weight
Dependent versus Independent
Variables
These terms developed out of an
experimental research paradigm

Dependent variables are the traits that we


are trying to explain or predict

Independent variables are the traits that


we are using to try to explain or predict
the dependent variable
Overview of Association Methods

Dependent Independent
Variable Variable Method
Categorical Categorical Relative Risk (C.I.)
(Discrete) (Discrete) Odds Ratio (C.I.)
Chi-square test
test of proportions

Categorical Continuous Logistic regression


(Discrete) Discriminate
Overview of Association Methods

Dependent Independent
Variable Variable Method
Continuous Continuous Linear regression
Correlations

Continuous Categorical t-test


(Discrete) Analysis of
Variance
WHAT is statistical significance and
WHAT does it tell us?
We use statistics to tell us whether apparent
differences between samples are real or
due to chance

A p-value is the probability that a TEST


STATISTIC would be as extreme or more
extreme than observed if the null hypothesis
was true.

For example, a p-value of 0.05 means that


5% of the time we would observe this test
statistic by chance alone (i.e., there is no
What is a confidence interval?

A 95% confidence interval tells us that


with 95% probability the true value of
the variable is contained within the
interval.

( 95% confidence)
(----- 99% confidence------
)

Point estimate
Measures of association for
categorical dependent
variables
and
categorical independent
variables
Test of Proportions and Chi-Square
test are used in related but different
situations

1) We sample members of 2 groups and classify


each member according to some qualitative
characteristic (e.g. cigarette smoking). The
hypothesis is
H0: groups are homogeneous (p1j=p2j=)
HA: groups are not homogeneous
2) We sample members of a population and
cross-classify each member according to two
qualitative characteristics. The hypothesis is
Test of Proportions
The hypothesis that two groups are the
same is addressed by the hypotheses:

H0 : p1 = p2
HA : p1 p2
A statistic useful for this comparison is the
difference in the observed, or sample,
proportions
p1 p 2

Q: What is the distribution of this statistic?


A: Approximately normal.
Q: What is the distribution of this statistic?
A: Approximately normal.

p1 1 p1 p2 1 p2
p1 p 2 ~ N p1 p2 ,
2

n1 n2
Estimator for (p1 p2):

(p1 p2) = (X1/n1 X2/n2)


Standard Deviation: p1q1 + p2q2
n1 n2

A (1-100%) large-sample confidence interv


for (p1 p2):

(p1 p2) +/- p1q1 p2q2


+
n1 n2
Standard Normal Distribution, Z
Confidence Interval for a
difference in proportions

p 1 1 p 1 p 2 1 p 2
p 1 p 2 1.96
n1 n2

Note: A common estimate isnt used when


confidence intervals are computed for the difference
in the population proportions, p1 - p2. In this case,
we dont have any assumption regarding the
relationship between p1 and p2 so use the following
Simple Test of proportions

The test statistic used for testing H0: p1 = p2 is:

p1 p 2 0
Z
p 0 1 p 0 p 0 1 p 0

n1 n2

Note: The test is still valid if we had simply used the


separate estimates, p1 and p2 instead of the common
estimate based on H0.
Example:

Test for the difference between

a) the proportion of women who had


their blood pressure reduced by a
certain drug

versus

b) the proportion of men who had their


blood pressure reduced.
In a study to investigate drug Ys potential in
lowering blood pressure between hypertensive
men and women, 50 women and 100 men were
given the drug. At the end of the study, the
results below were reported:

a. Estimate the difference in the true proportions


who had their blood pressure reduced with a 99%
confidence interval. Men Women
Sample size 100 50

# with reduced BP 65 38
HO: The proportion of men who had their
blood pressure reduced is the same as that
of the women who had their blood pressure
reduced.

HA: The proportion of men who had their


blood pressure reduced is not the same as
that of the women who had their blood
pressure reduced.
Proportion of men with reduced blood pressure
65/100 = .65 = p1

Proportion of women with reduced blood


pressure:
38/50 = .76 = p2

p1q1 p2q2
Point estimate for p1- p2 = .76-.65
+ = .11
n1 n2

.76*.24
Standard .65*.35(p1-p2) =
deviation =
+ = .0770
50 100
Z =p1-p2 = .76-.65 = .11
standard deviation = .
0770
= 1.423

Is this significant at a 0.05 level


of significance?
Standard Normal Distribution, Z
What is the 99% confidence
interval?
p1q1 p2q2
(p1 p2) +/- 2.58 +
n1 n2

= .11 +/- (2.58)(.0770)


= .11 +/- .199

= (-0.011, 0.309)
Conclusion:

Since the confidence interval contains 0,


there is no significant difference between
p1 and p2.
Chi-square test

Is for the case where we sample members of a


population and cross-classify each member
according to two qualitative characteristics.

The hypothesis is
H0: factors are independent (pij=pi.p.j )
HA: factors are not independent
Chi - Square Test of Independence

Null Hypothesis: Variable 1 is independent


of variable 2.

Alternative Hypothesis: Variable 1 is


associated with variable 2.

=
2 (O E ) 2

(df) E
or k
[nij E(nij)]2 n i . n .j
2= i,j=1
E (nij) =
E (nij) n ..

Degrees of freedom: (r-1)(c-1)


Chi-square Test Example
Diabetic Not Diabetic Total
Smoking 50 (n ) 11 25 (n )
12 75 (n )
1.

Not smoking20 (n ) 21 5 (n )22 25 (n )


2.

Total 70 (n ) .1 30 (n ).2 100 (n )


..

HO: No association between smoking and


diabetes.

HA: An association
k between smoking and
diabetes. [n ij E(nij)]2
2= i,j=1

Using: E (nij)
1. Calculate the expected values:
Diabetic Not Diabetic Total
Smoking 75*70/100 75*30/100 75
Not smoking 25*70/100 25*30/100 25
Total 70 30 100
2. Add up the squared differences in Obs -
Exp and divide by the expected values
= ((50-(75*70/100))2/ 75*70/100) +
((25-(75*30/100))2/ 75*30/100) +
((20-(25*70/100))2/ 25*70/100) +
((5 -(25*30/100))2/ 25*30/100) = 1.59

3. Figure out the degrees of freedom


Df= (r-1)(c-1) = 2-1)(2-1) = 1
Upper percentiles of 2 distributions
= distribution
with k degrees of
freedom
Area = 1 - p

0
p, k

df 0.90 0.95 0.99


1 2.706 3.841 6.635
5 9.236 11.070 15.086
10 15.987 18.307 23.209
15 22.307 24.996 30.578
Back to RRs and Ors:

For assessing whether the main


epidemiological measures of
association (relative risk and odds
ratio) are different from the null
value we focus on confidence
interval estimation.
Confidence Interval of the RR
Diseased Not Diseased Total
Exposure a b a+b
No exposure c d c+d
Total a+c b+d a+b+c+d
a/(a+b)
RR =
c/(c+d)

Its easiest to determine the confidence interval


by
taking the natural logarithm (ln) of the RR
because the only way to get a reasonable
formula for the variance of RR is to work in the
world of natural logs
Variance for ln RR=
{b/a*(a+b)} +
{d/c*(c+d)}
=
Confidence Interval

= (RR) exp [+/- variance(ln RR) ]

= exp [ln RR +/- variance(ln RR) ]


Illustration:
Results of a cohort study that followed 100
non-diabetic nurses for 15 years. At the end
of the 15 years their smoking behavior was
related to their diabetic status.
Diabetic Not Diabetic Total
Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
50/75
RR = = 0.666667/0.8 = 0.833375
20/25
lnRR= -0.18182

SD (lnRR) = (((25/(50*75)) + (5/(20*25)))1/2


Diabetic Not Diabetic Total
Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
lnRR = -0.18182

95% CI for the ln RR = -0.18182 +/-


1.96*0.1291
= -0.43486, 0.071216

95% CI for the RR = (e-0.43486 e0.071216)


= (0.64735, 1.0718)

Conclusion: since the CI contains 1, there is not


a significant relationship between smoking and
Confidence Interval of the Odds Ratio
Odds for disease = p/1-p

OR = p1/(1-p1)
p2/(1-p2)

Diseased Not Diseased Total


Exposure a b a+b
No exposure c d c+d
Total a+c b+d a+b+c+d
Using the Taylor expansion method, an
approximate (1-)CI for the OR is:
CI = (ad/bc) exp [+/- (1/a+1/b+1/c+1/d)]
Results of a case-control study that investigated
the relation between diabetes and smoking.

Diabetic Not Diabetic Total


Smoking 50 25 75
Not smoking 20 5 25
Total 70 30 100
HO: No association between smoking and
diabetes.

HA: An association
(50*5)between smoking and
diabetes. (20*25)
Measures of association for
continuous dependent
variables
and
categorical independent
variables
Normal Distribution
A common probability model for continuous
data
Can be used to characterize the Binomial or
Poisson under certain circumstances
Bell-shaped curve
takes values between - and +
symmetric about mean
mean=median=mode

Examples
birthweights, height, weight
The arithmetic mean is the most common
measure of the central location of a
sample.
1 n
X Xj
n j 1

The standard deviation tells us how widely


dispersed the values are around the mean. It
is a measure of variation. 2
X j X
1 n
s
2

n 1 j 1
The T-Test
Tests for the equality of means in 2 groups

Null Hypothesis:
The two sample means are equal

HO: X1 X2 = 0 or X1 = X2

Alternate Hypothesis:
The two sample means are different

H : X X = 0 or X = X
(X1 X2)
Test Statistic: t (df) =
s2 s2
n1 +
1 2
n2

n1

i=1
(X 1i X1)2
where s2=
1 (n1 1)

n2
(X 2i X2)2
and s2= i=1
2
(n2 1)

Degrees of freedom (df): n1 + n2 - 2


Example: Does drug X influence bilirubin lev

In a study to determine the effectiveness of a


drug in
lowering the plasma bilirubin level, 14 subjects
were randomly divided into two groups (drug X
vs.
placebo). After 14 days, the change in bilirubin
was Drug X Placebo
estimated
Sample sizein the two groups
7 (day 1 day
7 14).
Mean change 1.26 units 0.78 units
Standard deviation 0.32 units 0.32 units
Example: Does drug X influence bilirubin leve

HO: The reduction in plasma bilirubin levels


in the
treatment and placebo groups were the
same after
14 days.

HA: The reduction in plasma bilirubin levels


in the treatment and placebo groups were
different after
14 days.
Calculating the t-statistic:

(1.26 0.78)
t= = 2.806
(.32)2 (.32)2
+
7 7

Calculating the degrees of freedom:

n1 + n 2 2 = 7 + 7 2 =
12
What is the p-value?
Upper percentile of t distribution

Area =

0 t n

df 0.10 0.05 0.01


1 3.078 6.3138 31.821
5 1.476 2.0150 3.365
10 1.372 1.8125 2.764
15 1.341 1.7530 2.602
Conclusion:

The mean drop in bilirubin levels of individu


on drug X was significantly greater than for
individuals taking the placebo.
CI around a mean of a normally distributed variab

X
n

Sample mean= X = i
I=1

Small sample (1-) 100% Confidence interval

S
X + t/2
n

Where s/n is the estimated standard error of the mea


Illustration:
Given a mean of 0.53 and a standard deviation of .055
where n=6, the 95% confidence interval would be

.0559
53 + 2.571 6 Or .53 + .059

When two confidence intervals do not overlap for


two two subgroups it indicates that the means
are significantly different.
Summary: Steps in Statistical Analysis

1. Identify H0 and HA
2. Identify a test statistic
3. Determine a significance level, =
0.05, = 0.01
4. Critical value determines rejection /
acceptance region
5. p-value
6. Interpret the result