Association 1

Association 1

epidemiology?

Generalization of conclusions:

sample population

Assess strength of evidence

Make comparisons

Make predictions

Role of Statistics in Public

Health and Medicine

Science Statistics

1. Idea or Question 1. Stats model /

hypothesis

2. Collect data and

make observations 2. Study design

3. Describe data 3. Descriptive statistics

strength of

evidence for or

against the

hypothesis

There are basically two main fields in

statistics:

Point Estimation

For example, estimating Relative Risks

Hypothesis Testing

For example, testing if Relative Risk =

1

Types of Data

Categorical (qualitative)

Nominal scale - no natural order

gender, marital status, race

Ordinal scale

severity scale, good/better/best

Types of Data

Numerical (quantitative)

Discrete - (few) integer values

number of children in a family

Continuous - measure to arbitrary

precision

blood pressure, weight

Dependent versus Independent

Variables

These terms developed out of an

experimental research paradigm

are trying to explain or predict

we are using to try to explain or predict

the dependent variable

Overview of Association Methods

Dependent Independent

Variable Variable Method

Categorical Categorical Relative Risk (C.I.)

(Discrete) (Discrete) Odds Ratio (C.I.)

Chi-square test

test of proportions

(Discrete) Discriminate

Overview of Association Methods

Dependent Independent

Variable Variable Method

Continuous Continuous Linear regression

Correlations

(Discrete) Analysis of

Variance

WHAT is statistical significance and

WHAT does it tell us?

We use statistics to tell us whether apparent

differences between samples are real or

due to chance

STATISTIC would be as extreme or more

extreme than observed if the null hypothesis

was true.

5% of the time we would observe this test

statistic by chance alone (i.e., there is no

What is a confidence interval?

with 95% probability the true value of

the variable is contained within the

interval.

( 95% confidence)

(----- 99% confidence------

)

Point estimate

Measures of association for

categorical dependent

variables

and

categorical independent

variables

Test of Proportions and Chi-Square

test are used in related but different

situations

each member according to some qualitative

characteristic (e.g. cigarette smoking). The

hypothesis is

H0: groups are homogeneous (p1j=p2j=)

HA: groups are not homogeneous

2) We sample members of a population and

cross-classify each member according to two

qualitative characteristics. The hypothesis is

Test of Proportions

The hypothesis that two groups are the

same is addressed by the hypotheses:

H0 : p1 = p2

HA : p1 p2

A statistic useful for this comparison is the

difference in the observed, or sample,

proportions

p1 p 2

A: Approximately normal.

Q: What is the distribution of this statistic?

A: Approximately normal.

p1 1 p1 p2 1 p2

p1 p 2 ~ N p1 p2 ,

2

n1 n2

Estimator for (p1 p2):

Standard Deviation: p1q1 + p2q2

n1 n2

for (p1 p2):

+

n1 n2

Standard Normal Distribution, Z

Confidence Interval for a

difference in proportions

p 1 1 p 1 p 2 1 p 2

p 1 p 2 1.96

n1 n2

confidence intervals are computed for the difference

in the population proportions, p1 - p2. In this case,

we dont have any assumption regarding the

relationship between p1 and p2 so use the following

Simple Test of proportions

p1 p 2 0

Z

p 0 1 p 0 p 0 1 p 0

n1 n2

separate estimates, p1 and p2 instead of the common

estimate based on H0.

Example:

their blood pressure reduced by a

certain drug

versus

blood pressure reduced.

In a study to investigate drug Ys potential in

lowering blood pressure between hypertensive

men and women, 50 women and 100 men were

given the drug. At the end of the study, the

results below were reported:

who had their blood pressure reduced with a 99%

confidence interval. Men Women

Sample size 100 50

# with reduced BP 65 38

HO: The proportion of men who had their

blood pressure reduced is the same as that

of the women who had their blood pressure

reduced.

blood pressure reduced is not the same as

that of the women who had their blood

pressure reduced.

Proportion of men with reduced blood pressure

65/100 = .65 = p1

pressure:

38/50 = .76 = p2

p1q1 p2q2

Point estimate for p1- p2 = .76-.65

+ = .11

n1 n2

.76*.24

Standard .65*.35(p1-p2) =

deviation =

+ = .0770

50 100

Z =p1-p2 = .76-.65 = .11

standard deviation = .

0770

= 1.423

of significance?

Standard Normal Distribution, Z

What is the 99% confidence

interval?

p1q1 p2q2

(p1 p2) +/- 2.58 +

n1 n2

= .11 +/- .199

= (-0.011, 0.309)

Conclusion:

there is no significant difference between

p1 and p2.

Chi-square test

population and cross-classify each member

according to two qualitative characteristics.

The hypothesis is

H0: factors are independent (pij=pi.p.j )

HA: factors are not independent

Chi - Square Test of Independence

of variable 2.

associated with variable 2.

=

2 (O E ) 2

(df) E

or k

[nij E(nij)]2 n i . n .j

2= i,j=1

E (nij) =

E (nij) n ..

Chi-square Test Example

Diabetic Not Diabetic Total

Smoking 50 (n ) 11 25 (n )

12 75 (n )

1.

2.

..

diabetes.

HA: An association

k between smoking and

diabetes. [n ij E(nij)]2

2= i,j=1

Using: E (nij)

1. Calculate the expected values:

Diabetic Not Diabetic Total

Smoking 75*70/100 75*30/100 75

Not smoking 25*70/100 25*30/100 25

Total 70 30 100

2. Add up the squared differences in Obs -

Exp and divide by the expected values

= ((50-(75*70/100))2/ 75*70/100) +

((25-(75*30/100))2/ 75*30/100) +

((20-(25*70/100))2/ 25*70/100) +

((5 -(25*30/100))2/ 25*30/100) = 1.59

Df= (r-1)(c-1) = 2-1)(2-1) = 1

Upper percentiles of 2 distributions

= distribution

with k degrees of

freedom

Area = 1 - p

0

p, k

1 2.706 3.841 6.635

5 9.236 11.070 15.086

10 15.987 18.307 23.209

15 22.307 24.996 30.578

Back to RRs and Ors:

epidemiological measures of

association (relative risk and odds

ratio) are different from the null

value we focus on confidence

interval estimation.

Confidence Interval of the RR

Diseased Not Diseased Total

Exposure a b a+b

No exposure c d c+d

Total a+c b+d a+b+c+d

a/(a+b)

RR =

c/(c+d)

by

taking the natural logarithm (ln) of the RR

because the only way to get a reasonable

formula for the variance of RR is to work in the

world of natural logs

Variance for ln RR=

{b/a*(a+b)} +

{d/c*(c+d)}

=

Confidence Interval

Illustration:

Results of a cohort study that followed 100

non-diabetic nurses for 15 years. At the end

of the 15 years their smoking behavior was

related to their diabetic status.

Diabetic Not Diabetic Total

Smoking 50 25 75

Not smoking 20 5 25

Total 70 30 100

50/75

RR = = 0.666667/0.8 = 0.833375

20/25

lnRR= -0.18182

Diabetic Not Diabetic Total

Smoking 50 25 75

Not smoking 20 5 25

Total 70 30 100

lnRR = -0.18182

1.96*0.1291

= -0.43486, 0.071216

= (0.64735, 1.0718)

a significant relationship between smoking and

Confidence Interval of the Odds Ratio

Odds for disease = p/1-p

OR = p1/(1-p1)

p2/(1-p2)

Exposure a b a+b

No exposure c d c+d

Total a+c b+d a+b+c+d

Using the Taylor expansion method, an

approximate (1-)CI for the OR is:

CI = (ad/bc) exp [+/- (1/a+1/b+1/c+1/d)]

Results of a case-control study that investigated

the relation between diabetes and smoking.

Smoking 50 25 75

Not smoking 20 5 25

Total 70 30 100

HO: No association between smoking and

diabetes.

HA: An association

(50*5)between smoking and

diabetes. (20*25)

Measures of association for

continuous dependent

variables

and

categorical independent

variables

Normal Distribution

A common probability model for continuous

data

Can be used to characterize the Binomial or

Poisson under certain circumstances

Bell-shaped curve

takes values between - and +

symmetric about mean

mean=median=mode

Examples

birthweights, height, weight

The arithmetic mean is the most common

measure of the central location of a

sample.

1 n

X Xj

n j 1

dispersed the values are around the mean. It

is a measure of variation. 2

X j X

1 n

s

2

n 1 j 1

The T-Test

Tests for the equality of means in 2 groups

Null Hypothesis:

The two sample means are equal

HO: X1 X2 = 0 or X1 = X2

Alternate Hypothesis:

The two sample means are different

H : X X = 0 or X = X

(X1 X2)

Test Statistic: t (df) =

s2 s2

n1 +

1 2

n2

n1

i=1

(X 1i X1)2

where s2=

1 (n1 1)

n2

(X 2i X2)2

and s2= i=1

2

(n2 1)

Example: Does drug X influence bilirubin lev

drug in

lowering the plasma bilirubin level, 14 subjects

were randomly divided into two groups (drug X

vs.

placebo). After 14 days, the change in bilirubin

was Drug X Placebo

estimated

Sample sizein the two groups

7 (day 1 day

7 14).

Mean change 1.26 units 0.78 units

Standard deviation 0.32 units 0.32 units

Example: Does drug X influence bilirubin leve

in the

treatment and placebo groups were the

same after

14 days.

in the treatment and placebo groups were

different after

14 days.

Calculating the t-statistic:

(1.26 0.78)

t= = 2.806

(.32)2 (.32)2

+

7 7

n1 + n 2 2 = 7 + 7 2 =

12

What is the p-value?

Upper percentile of t distribution

Area =

0 t n

1 3.078 6.3138 31.821

5 1.476 2.0150 3.365

10 1.372 1.8125 2.764

15 1.341 1.7530 2.602

Conclusion:

on drug X was significantly greater than for

individuals taking the placebo.

CI around a mean of a normally distributed variab

X

n

Sample mean= X = i

I=1

S

X + t/2

n

Illustration:

Given a mean of 0.53 and a standard deviation of .055

where n=6, the 95% confidence interval would be

.0559

53 + 2.571 6 Or .53 + .059

two two subgroups it indicates that the means

are significantly different.

Summary: Steps in Statistical Analysis

1. Identify H0 and HA

2. Identify a test statistic

3. Determine a significance level, =

0.05, = 0.01

4. Critical value determines rejection /

acceptance region

5. p-value

6. Interpret the result

