You are on page 1of 123

CPH Exam Review

Biostatistics

Lisa Sullivan, PhD

Associate Dean for Education
Professor and Chair, Department of Biostatistics
Boston University School of Public Health

Outline and Goals

Overview of Biostatistics (Core Area)
Terminology and Definitions
Practice Questions

An archived version of this review, along with the PPT file, will
be available on the NBPHE website (www.nbphe.org) under
Study Resources

Biostatistics
Two Areas of Applied Biostatistics:
Descriptive Statistics
Summarize a sample selected from a
population
Inferential Statistics
parameters based on sample statistics.

Variable Types
Dichotomous variables have 2 possible
responses (e.g., Yes/No)
Ordinal and categorical variables have
more than two responses and responses
are ordered and unordered, respectively
Continuous (or measurement) variables
assume in theory any values between a
theoretical minimum and maximum

We want to study whether individuals over 45

years are at greater risk of diabetes than those
younger than 45. What kind of variable is age?

1.
2.
3.
4.

Dichotomous
Ordinal
Categorical
Continuous

We are interested in assessing disparities in

infant morbidity by race/ethnicity. What
kind of variable is race/ethnicity?

1.
2.
3.
4.

Dichotomous
Ordinal
Categorical
Continuous

Numerical Summaries of Dichotomous,

Categorical and Ordinal Variables
Frequency Distribution Table
Heath Status

Freq.

Rel. Freq.

Cumulative
Freq

Cumulative
Rel. Freq.

Excellent

19

38%

19

38%

Very Good

12

24%

31

62%

Good

18%

40

80%

Fair

12%

46

92%

Poor

8%

50

100%

n=50

100%
Ordinal variables only

Relative Frequency Histogram

Continuous Variables
Assume, in theory, any value between
a theoretical minimum and maximum
Quantitative, measurement variables
Example systolic blood pressure

Second sample

Summarizing Location and

Variability
When there are no outliers, the sample
mean and standard deviation
summarize location and variability
When there are outliers, the median
and interquartile range (IQR)
summarize location and variability,
where IQR = Q3-Q1
Outliers <Q11.5 IQR or >Q3+1.5 IQR

Min

Q1

Median

Q3

Max

Comparing Samples with

Box and Whisker Plots
2

100

110

120

130

140

150

160

What type of display is shown

below?
Percent Patients by Disease Stage
35
30
25
%

20
15
10
5
0
I

1.
2.
3.
4.

II

III

IV

Frequency bar chart

Relative frequency bar chart
Frequency histogram
Relative frequency histogram

The distribution of SBP in men, 20-29 years

is shown below. What is the best summary
of a typical value

1.
2.
3.
4.

Mean
Median
Interquartile range
Standard Deviation

When data are skewed, the mean

is higher than the median.
1. True
2. False

The best summary of variability for the

following continuous variable is

1.
2.
3.
4.

Mean
Median
Interquartile range
Standard Deviation

Numerical and Graphical

Summaries
Dichotomous and categorical
Frequencies and relative frequencies
Bar charts (freq. or relative freq.)

Ordinal
Frequencies, relative frequencies,
cumulative frequencies and cumulative
relative frequencies
Histograms (freq. or relative freq.

Continuous
n, and s or median and IQR (if outliers)
Box whisker plot

What is the probability of selecting a

male with optimal blood pressure?
Blood Pressure Category
Optimal Normal Pre-Htn Htn
Male
Female
Total

20 15 15 30 80
5 15 25 25 70
25 30 40 55 150

1. 20/25
2. 20/80
3. 20/150

Total

What is the probability of selecting a

patient with Pre-Htn or Htn?
Blood Pressure Category
Optimal Normal Pre-Htn Htn
Male
Female
Total

20 15 15 30 80
5 15 25 25 70
25 30 40 55 150

1. 95/150
2. 45/80
3. 55/150

Total

prevalent CVD?
CVD

Free of CVD

Men

35

265

Women

45

355

1. 35/80
2. 35/265
3. 35/300

CVD are men ?
CVD

Free of CVD

Men

35

265

Women

45

355

1. 35/700
2. 35/80
3. 80/300

Are Family History and Current

Status Independent?
Example. Consider the following table which cross
classifies subjects by their family history of CVD and
current (prevalent) CVD status.
Current CVD
Family History

No

Yes

No

215

25

Yes

90

15

P(Current CVD| Family Hx) = 15/105 = 0.143

P(Current CVD| No Family Hx) = 25/240 = 0.104

Are symptoms independent of

disease?
Disease

No Disease

Symptoms

25

225

250

No Symptoms

50

450

500

1. No
2. Yes

Total

Probability Models
Binomial Distribution
Two possible outcomes: success and
failure
Replications of process are independent
P(success) is constant for each
replication
n!
P(x)
p x (1 p) n x
x!(n x)!

Mean=np, variance=np(1-p)

Probability Models
Poisson Distribution
Two possible outcomes: success and
failure
Replications of process are independent
Often used to model counts (often used
to model rare events)

P(x) (e ) / x!
-

Mean=m, variance=m

Probability Models
Normal Distribution
Model for continuous outcome
Mean=median=mode

Normal Distribution
Properties of Normal Distribution
I) The normal distribution is symmetric about the
mean (i.e., P(X > ) = P(X < ) = 0.5).
ii) The mean and variance ( and 2) completely
characterize the normal distribution.
iii) The mean = the median = the mode
iv) Approximately 68% of obs between mean + 1 sd
95% between mean + 2 sd, and >99% between
mean + 3 sd

Normal Distribution
Body mass index (BMI) for men age 60 is
normally distributed with a mean of 29 and
standard deviation of 6.
What is the probability that
a male has BMI < 29?
P(X<29)= 0.5
11

17

23

29

35

41

47

Normal Distribution
What is the probability that a male has BMI less than
30?

P(X<30)=?
11

17

23

29

35

41

47

Standard Normal Distribution Z

Normal distribution with =0 and =1

-3

-2

-1

Normal Distribution

x 30 29
Z

0.17

6
P(X<30)= P(Z<0.17) = 0.5675
From a table of standard normal
probabilities or statistical
computing package.

Comparing Systolic Blood

Pressure (SBP)
Comparing systolic blood pressure (SBP)
Suppose
for
Males
Age
50,
approximately normally distributed
mean of 108 and a standard deviation
Suppose for Females Age 50,
approximately normally distributed
mean of 100 and a standard deviation

SBP
is
with a
of 14
SBP is
with a
of 8

If a Male Age 50 has a SBP = 140 and a

Female Age 50 has a SBP = 120, who has the
relatively higher SBP ?

Normal Distribution
ZM = (140 - 108) / 14 = 2.29
ZF = (120 - 100) / 8 = 2.50
Which is more extreme?

Percentiles of the Normal

Distribution
The kth percentile is defined as the score that
holds k percent of the scores below it.
Eg., 90th percentile is the score that holds
90% of the scores below it.
Q1 = 25th percentile, median = 50th percentile,
Q3 = 75th percentile

Percentiles
For the normal distribution, the following is used
to compute percentiles:
X=+Z
where
= mean of the random variable X,
= standard deviation, and
Z = value from the standard normal distribution
for the desired percentile (e.g., 95th, Z=1.645).
95th percentile of BMI for Men: 29+1.645(6) = 38.9

Central Limit Theorem

(Non-normal) population with
Take samples of size n as long as n is
sufficiently large (usually n > 30 suffices)
The distribution of the sample mean is
approximately normal, therefore can use
Z to compute probabilities

x
Z
n

Standard error

Statistical Inference
There are two broad areas of statistical
inference, estimation and hypothesis testing.
Estimation. Population parameter is unknown,
sample statistics are used to generate estimates.
parameter, sample statistics support or refute
statement.

What Analysis To Do When

Nature of primary outcome variable
Continuous, dichotomous, categorical,
time to event

Number of comparison groups

One, 2 independent, 2 matched or
paired, > 2

Associations between variables

Regression analysis

Estimation
Process of determining likely values for
unknown population parameter
Point estimate is best single-valued
estimate for parameter
Confidence interval is range of values for
parameter:
point estimate + margin of error
point estimate + t SE (point estimate)

Hypothesis Testing Procedures

1. Set up null and research
hypotheses, select
2. Select test statistic
3. Set up decision rule
4. Compute test statistic
5. Draw conclusion & summarize
significance (p-value)

P-values
P-values represent the exact
significance of the data
Estimate p-values when rejecting H0
to summarize significance of the data
(approximate with statistical tables,
exact value with computing package)
If p < then reject H0

Errors in Hypothesis Tests

Conclusion of Statistical Test
Do Not Reject H0
Reject H0
H0 true
H0 false

Correct

Type I error

Type II error

Correct

Continuous Outcome
Confidence Interval for
Continuous outcome - 1 Sample
n > 30

n < 30

XZ

Xt

s
n

Example.
95% CI for mean waiting time at ED
Data: n=100, =37.85 and s=9.5
mins

37.85 1.96
37.85 + 1.86
(35.99 to 39.71)

Statistical computing packages use t throughout.

9.5
100

New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
One study sample
Data
On each participant, measure outcome
(yes/no)
x
n, x=# positive responses, p

Dichotomous Outcome
Confidence Interval for p
Dichotomous outcome - 1 Sample

p(1 - p)
p Z
n
min[np, n(1 p)] 5
otherwise, exact procedures

Example.
In the Framingham Offspring
Study (n=3532), 1219 patients
were on antihypertensive
medications. Generate 95% CI.

0.345 1.96

0.345(1 - 0.345)
3532

0.345 + 0.016
(0.329, 0.361)

One Sample Procedures Comparisons

with Historical/External Control
Continuous
H0: 0

Dichotomous
H0: pp0

H1: 0, <0, 0

n>30

n<30

X - 0
s/ n

H1: pp0, <p0, p0

Z

p - p 0
p 0 (1 - p 0 )
n

X - 0

min[np0 , n(1 p 0 )] 5

s/ n

One Sample Procedures Comparisons

with Historical/External Control
Categorical or Ordinal outcome
2 Goodness of fit test
H0: p1p10, p2p20, . . . , pkpk0
H1: H0 is false

(O - E )
=
E
2

New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two independent study samples
Data
On each participant, identify group and
measure outcome
n , X , s 2 (or s ), n , X , s 2 (or s )
1

Two Independent Samples

Cohort Study - Set of Subjects Who
Meet Study Inclusion Criteria

Group 1
Group 2
Mean Group 1 Mean Group 2

Two Independent Samples

RCT: Set of Subjects Who Meet
Study Eligibility Criteria
Randomize

Treatment 1
Mean Trt 1

Treatment 2
Mean Trt 2

Continuous Outcome
Confidence Interval for (
Continuous outcome - 2 Independent Samples
n1>30 and n2>30

1
1
(X1 - X 2 ) ZSp

n1 n 2

n1<30 or n2<30

1
1
(X1 - X 2 ) tSp

n1 n 2

Sp

(n 1 1)s 12 (n 2 1)s 22
n1 n 2 2

Hypothesis Testing for (

Continuous outcome
2 Independent Sample
H0: 2

(2 = 0)

H1: 2, <2, 2

Test Statistic
n1>30 and n2> 30

n1<30 or n2<30

X1 - X 2
1
1
Sp

n1 n 2

X1 - X 2
1
1
Sp

n1 n 2

An RCT is planned to show the efficacy of

a new drug vs. placebo to lower total
cholesterol.
What are the hypotheses?

1. H0: P=N H1: P>N

2. H0: P=N H1: P<N
3. H0: P=N H1: PN

New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
Two independent study samples
Data
On each participant, identify group and
measure outcome (yes/no)
n ,p
, n , p
1

Dichotomous Outcome
Confidence Interval for (pp
Dichotomous outcome - 2 Independent Samples

min[n 1p1 , n1 (1 p1 ), n 2 p 2 , n 2 (1 p 2 )] 5
p1 (1 - p1 ) p 2 (1 p 2 )
(p1 - p 2 ) Z

n1
n2

Measures of Effect for

Dichotomous Outcomes
Outcome = dichotomous (Y/N or 0/1)
Risk=proportion of successes = x/n
Odds=ratio of successes to failures=x/(n-x)

Measures of Effect for

Dichotomous Outcomes
Risk Difference = p1 - p 2
Relative Risk = p1/p 2
Odds Ratio = p1 /(1 p1 )

p 2 /(1 p 2 )

Confidence Intervals for Relative

Risk (RR)
Dichotomous outcome
2 Independent Samples

(n 1 - x1 )/x 1 (n 2 - x 2 )/x 2

ln( RR) Z

n1
n2
exp(lower limit), exp(upper limit)

Confidence Intervals for Odds Ratio

(OR)
Dichotomous outcome
2 Independent Samples

1
1
1
1

ln( OR) Z

x1 (n 1 x1 ) x 2 (n 2 x 2 )
exp(lower limit), exp(upper limit)

Hypothesis Testing for (p1-p2)

Dichotomous outcome
2 Independent Sample
H0: p1=p2
H1: p1>p2, p1<p2, p1p2
Test Statistic
min[n 1p1 , n1 (1 p1 ), n 2 p 2 , n 2 (1 p 2 )] 5
Z

p1 - p 2
1 1
p(1 - p)

n1 n 2

Two (Independent) Group

Comparisons
Difference in birth
weight is -106 g,
95% CI for difference
in mean Birth weight:
(-175.3 to -36.7)

New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two matched study samples
Data
On each participant, measure outcome
under each experimental condition
Compute differences (D=X1-X2)
n, X d , s d

Subject ID
1
2
.
.

Measure 1
55
70
42
60

Measure 2

Measures taken serially in time or under

different experimental conditions

Crossover Trial
Treatment

Treatment

Eligible
R
Participants
Placebo

Placebo

Confidence Intervals for d

Continuous outcome
2 Matched/Paired Samples
n > 30

sd
Xd Z
n

n < 30

sd
Xd t
n

Hypothesis Testing for d

Continuous outcome
2 Matched/Paired Samples
H0: d
H1: d, d<0, d0
Test Statistic
n>30
n<30

Xd - d
sd

Xd - d
sd

Statistical Significance versus

Effect Size
P-value summarizes significance
Confidence intervals give magnitude
of effect
(If null value is included in CI, then
no statistical significance)

means is
1.
2.
3.
4.

0
0.5
1
2

is
1.
2.
3.
4.

0
0.5
1
2

1.
2.
3.
4.

0
0.5
1
2

proportions is
1.
2.
3.
4.

0
0.5
1
2

1.
2.
3.
4.

0
0.5
1
2

A two sided test for the equality of

means produces p=0.20. Reject H0?
1. Yes
2. No
3. Maybe

Hypothesis Testing for More than 2

Means - Analysis of Variance
Continuous outcome
k Independent Samples, k > 2
H0: 2 k
H1: Means are not all equal
Test Statistic

n j (X j X) 2 /(k 1)
(X X j ) 2 /(N k)

F is ratio of between group variation to within group variation (error)

ANOVA Table
Source of
Variation

Sums of
Mean
Squares
df Squares F

Between
2
SSB = n j (X j - X )
Treatments
k-1 SSB/k-1 MSB/MSE
2

Error

SSE = N-k
(X -SSE/N-k
X j)

Total

2
N-1
(
)
SST = X X

ANOVA
When the sample sizes are equal, the
design is said to be balanced
Balanced designs give greatest power
and are more robust to violations of
the normality assumption

Extensions
Multiple Comparison Procedures
Used to test for specific differences in
means after rejecting equality of all
means (e.g., Tukey, Scheffe)
Higher-Order ANOVA - Tests for
differences in means as a function of
several factors

Extensions
Repeated Measures ANOVA - Tests for
differences in means when there are
multiple measurements in the same
participants (e.g., measures taken
serially in time)

2 Test of Independence
Dichotomous, ordinal or categorical outcome
2 or More Samples
H0: The distribution of the outcome is
independent of the groups
H1: H0 is false
Test Statistic

2
(O
E)
2
E

2 Test of Independence
Data organization (r by c table)
Outcome
Group

20%

40%

40%

50%

25%

25%

90%

5%

5%

Is there distribution of the outcome different

(associated with) groups

In Framingham Heart Study, we want to

assess risk factors for Impaired Glucose
Outcome = Glucose Category
Diabetes (glucose > 126),
Impaired Fasting Glucose (glucose 100-125),
Normal Glucose

Risk Factors

Sex
Age
BMI (normal weight, overweight, obese)
Genetics

What test would be used to assess whether

sex is associated with Glucose Category?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess whether

age is associated with Glucose Category?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess whether

BMI is associated with Glucose Category?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

In Framingham Heart Study, we want to

assess risk factors for Glucose Level
Consider a Secondary Outcome =
Fasting Glucose Level
Risk Factors

Sex
Age
BMI (normal weight, overweight, obese)
Genetics

What test would be used to assess whether

sex is associated with Glucose Level?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess whether

BMI is associated with Glucose Level?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess whether

age is associated with Glucose Level?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

In Framingham Heart Study, we want to

assess risk factors for Diabetes
Consider a Tertiary Outcome =
Diabetes Vs No Diabetes
Risk Factors

Sex
Age
BMI (normal weight, overweight, obese)
Genetics

What test would be used to assess

whether sex is associated with Diabetes?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess

whether BMI is associated with Diabetes?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

What test would be used to assess

whether age is associated with Diabetes?
1.
2.
3.
4.
5.

ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other

Correlation
Correlation (r) measures the nature
and strength of linear association
between two variables at a time
Regression equation that best
describes relationship between
variables

What is the most likely value of r

for the data shown below?
Y
*

*
*

*
*
*
*

*
*

*
*

1.
2.
3.
4.

r=-0.5
r=0
r=0.5
r=1

What is the most likely value of r

for the data shown below?
Y

*
*
*

1.
2.
3.
4.

* *

r=-0.5
r=0
r=0.5
r=1

* * *

Simple Linear Regression

Y = Dependent, Outcome variable
X = Independent, Predictor variable
= b 0 + b1 x

Simple Linear Regression

Assumptions
Linear relationship between X and Y
Independence of errors
Homoscedasticity (constant variance) of
the errors
Normality of errors

Multiple Linear Regression

Useful when we want to jointly
examine the effect of several X
variables on the outcome Y variable.
Y = continuous outcome variable
X1, X2, , Xp = set of independent or
predictor variables
.

y = b0 + b1 x1 + b2 x 2 + . . . + bp x p

Multiple Regression Analysis

Model is conditional, parameter
estimates are conditioned on other
variables in model
Perform overall test of regression
If significant, examine individual
predictors
Relative importance of predictors by pvalues (or standardized coefficients)

Multiple Regression Analysis

Predictors can be continuous,
indicator variables (0/1) or a set of
dummy variables
Dummy variables (for categorical
predictors)
Race: white, black, Hispanic
Black (1 if black, 0 otherwise)
Hispanic (1 if Hispanic, 0 otherwise)

Definitions
Confounding the distortion of the
effect of a risk factor on an outcome
Effect Modification a different
relationship between the risk factor
and an outcome depending on the
level of another variable

Multiple Regression for SBP:

Comparison of Parameter Estimates
Simple Models

Age
1.03
<.0001
Male
-2.26
.0009
BMI
1.80
<.0001
BP Meds 33.38
<.0001

Multiple Regression
p
0.86
<.0001
-2.22
.0002
1.48
<.0001
24.12
<.0001

RCT of New Drug to Raise HDL

Example of Effect Modification
Women

Mean

Std Dev

New drug

40

38.88

3.97

Placebo

41

39.24

4.21

Men

Mean

Std Dev

New drug

10

45.25

1.89

Placebo

39.06

2.22

Simple Logistic Regression

Outcome is dichotomous (binary)
We model the probability p of having
the disease.
b 0 b1X

e
p
b 0 b1X
1 e

logit( p) ln

p
b 0 b1x
1 p

Multiple Logistic Regression

Outcome is dichotomous (1=event,
0=non-event) and p=P(event)
Outcome is modeled as log odds

p
b 0 b1x1 b 2 x 2 ... b p x p
ln
1 - p

Multiple Logistic Regression for

Birth Defect (Y/N)
Predictor b
p OR (95% CI for OR)
Intercept
-1.099 0.0994
Smoke
1.062 0.2973 2.89 (0.34, 22.51)
Age
0.298 0.0420 1.35 (1.02, 1.78)
Interpretation of OR for age:
The odds of having a birth defect for the older of two
mothers differing in age by one year is estimated to
be 1.35 times higher after adjusting for smoking.

Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.
Measure whether person has event or not
(Yes/No) and if so, their time to event.
Determine factors associated with longer
survival.

Survival Analysis
Incomplete follow-up information
Censoring
Measure follow-up time and not time to
event
We know survival time > follow-up time

Log rank test to compare survival in

two or more independent groups

H0: Two survival curves are equal

2 Test with df=1. Reject H0 if 2 > 3.84
2 = 6.151. Reject H0.

Cox Proportional Hazards Model

Model:
ln(h(t)/h0(t)) = b1X1 + b2X2 + + bpXp
Exp(bi) = hazard ratio
Model used to jointly assess effects of
independent variables on outcome
(time to an event).

Outcome= all-cause mortality

Age and Sex as predictors
bi
p
HR
Age
0.11149 0.00011.118
Male Sex 0.67958 0.00011.973

Sample Size Determination

Need sample to ensure precision in
analysis
Sample size determined based on
type of planned analysis
CI
Test of hypothesis

Determining Sample Size for

Confidence Interval Estimates
Goal is to estimate an unknown
parameter using a confidence interval
estimate
Plan a study to sample individuals,
collect appropriate data and generate
CI estimate
How many individuals should we
sample?

Determining Sample Size for

Confidence Interval Estimates
Confidence intervals:
point estimate + margin of error
Determine n to ensure small margin
of error (precision) accounting for
attrition!
Must specify desired margin of error,
confidence level and variability of
parameter

Determining Sample Size for

Hypothesis Testing
How many participants are needed to
ensure that there is a high probability of
rejecting H0 when it is really false?
Determine n to ensure high power
(usually 80% or 90%) accounting for
attrition!
Must specify desired power, and effect
size (difference in parameter under H0
versus H1)

Determining Sample Size for

Hypothesis Testing
and Power are related to the sample
size, level of significance () and the
effect size (difference in parameter of
interest under H0 versus H1)
Power is higher with larger a
Power is higher with larger effect size
Power is higher with larger sample size

Sample Size Determination

Critical
Ethical
Sometimes difficult