CPH Exam Review
Biostatistics
Lisa Sullivan, PhD
Associate Dean for Education
Professor and Chair, Department of Biostatistics
Boston University School of Public Health
Outline and Goals
Overview of Biostatistics (Core Area)
Terminology and Definitions
Practice Questions
An archived version of this review, along with the PPT file, will
be available on the NBPHE website (www.nbphe.org) under
Study Resources
Biostatistics
Two Areas of Applied Biostatistics:
Descriptive Statistics
Summarize a sample selected from a
population
Inferential Statistics
Make inferences about population
parameters based on sample statistics.
Variable Types
Dichotomous variables have 2 possible
responses (e.g., Yes/No)
Ordinal and categorical variables have
more than two responses and responses
are ordered and unordered, respectively
Continuous (or measurement) variables
assume in theory any values between a
theoretical minimum and maximum
We want to study whether individuals over 45
years are at greater risk of diabetes than those
younger than 45. What kind of variable is age?
1.
2.
3.
4.
Dichotomous
Ordinal
Categorical
Continuous
We are interested in assessing disparities in
infant morbidity by race/ethnicity. What
kind of variable is race/ethnicity?
1.
2.
3.
4.
Dichotomous
Ordinal
Categorical
Continuous
Numerical Summaries of Dichotomous,
Categorical and Ordinal Variables
Frequency Distribution Table
Heath Status
Freq.
Rel. Freq.
Cumulative
Freq
Cumulative
Rel. Freq.
Excellent
19
38%
19
38%
Very Good
12
24%
31
62%
Good
18%
40
80%
Fair
12%
46
92%
Poor
8%
50
100%
n=50
100%
Ordinal variables only
Frequency Bar Chart
Relative Frequency Histogram
Continuous Variables
Assume, in theory, any value between
a theoretical minimum and maximum
Quantitative, measurement variables
Example systolic blood pressure
Standard Summary: n = 75, = 123.6, s = 19.4
Second sample
n = 75, = 128.1, s = 6.4
Summarizing Location and
Variability
When there are no outliers, the sample
mean and standard deviation
summarize location and variability
When there are outliers, the median
and interquartile range (IQR)
summarize location and variability,
where IQR = Q3Q1
Outliers <Q11.5 IQR or >Q3+1.5 IQR
Mean Vs. Median
Box and Whisker Plot
Min
Q1
Median
Q3
Max
Comparing Samples with
Box and Whisker Plots
2
100
110
120
130
140
Systolic Blood Pressure
150
160
What type of display is shown
below?
Percent Patients by Disease Stage
35
30
25
%
20
15
10
5
0
I
1.
2.
3.
4.
II
III
IV
Frequency bar chart
Relative frequency bar chart
Frequency histogram
Relative frequency histogram
The distribution of SBP in men, 2029 years
is shown below. What is the best summary
of a typical value
1.
2.
3.
4.
Mean
Median
Interquartile range
Standard Deviation
When data are skewed, the mean
is higher than the median.
1. True
2. False
The best summary of variability for the
following continuous variable is
1.
2.
3.
4.
Mean
Median
Interquartile range
Standard Deviation
Numerical and Graphical
Summaries
Dichotomous and categorical
Frequencies and relative frequencies
Bar charts (freq. or relative freq.)
Ordinal
Frequencies, relative frequencies,
cumulative frequencies and cumulative
relative frequencies
Histograms (freq. or relative freq.
Continuous
n, and s or median and IQR (if outliers)
Box whisker plot
What is the probability of selecting a
male with optimal blood pressure?
Blood Pressure Category
Optimal Normal PreHtn Htn
Male
Female
Total
20 15 15 30 80
5 15 25 25 70
25 30 40 55 150
1. 20/25
2. 20/80
3. 20/150
Total
What is the probability of selecting a
patient with PreHtn or Htn?
Blood Pressure Category
Optimal Normal PreHtn Htn
Male
Female
Total
20 15 15 30 80
5 15 25 25 70
25 30 40 55 150
1. 95/150
2. 45/80
3. 55/150
Total
What proportion of men have
prevalent CVD?
CVD
Free of CVD
Men
35
265
Women
45
355
1. 35/80
2. 35/265
3. 35/300
What proportion of patients with
CVD are men ?
CVD
Free of CVD
Men
35
265
Women
45
355
1. 35/700
2. 35/80
3. 80/300
Are Family History and Current
Status Independent?
Example. Consider the following table which cross
classifies subjects by their family history of CVD and
current (prevalent) CVD status.
Current CVD
Family History
No
Yes
No
215
25
Yes
90
15
P(Current CVD Family Hx) = 15/105 = 0.143
P(Current CVD No Family Hx) = 25/240 = 0.104
Are symptoms independent of
disease?
Disease
No Disease
Symptoms
25
225
250
No Symptoms
50
450
500
1. No
2. Yes
Total
Probability Models
Binomial Distribution
Two possible outcomes: success and
failure
Replications of process are independent
P(success) is constant for each
replication
n!
P(x)
p x (1 p) n x
x!(n x)!
Mean=np, variance=np(1p)
Probability Models
Poisson Distribution
Two possible outcomes: success and
failure
Replications of process are independent
Often used to model counts (often used
to model rare events)
P(x) (e ) / x!

Mean=m, variance=m
Probability Models
Normal Distribution
Model for continuous outcome
Mean=median=mode
Normal Distribution
Properties of Normal Distribution
I) The normal distribution is symmetric about the
mean (i.e., P(X > ) = P(X < ) = 0.5).
ii) The mean and variance ( and 2) completely
characterize the normal distribution.
iii) The mean = the median = the mode
iv) Approximately 68% of obs between mean + 1 sd
95% between mean + 2 sd, and >99% between
mean + 3 sd
Normal Distribution
Body mass index (BMI) for men age 60 is
normally distributed with a mean of 29 and
standard deviation of 6.
What is the probability that
a male has BMI < 29?
P(X<29)= 0.5
11
17
23
29
35
41
47
Normal Distribution
What is the probability that a male has BMI less than
30?
P(X<30)=?
11
17
23
29
35
41
47
Standard Normal Distribution Z
Normal distribution with =0 and =1
3
2
1
Normal Distribution
x 30 29
Z
0.17
6
P(X<30)= P(Z<0.17) = 0.5675
From a table of standard normal
probabilities or statistical
computing package.
Comparing Systolic Blood
Pressure (SBP)
Comparing systolic blood pressure (SBP)
Suppose
for
Males
Age
50,
approximately normally distributed
mean of 108 and a standard deviation
Suppose for Females Age 50,
approximately normally distributed
mean of 100 and a standard deviation
SBP
is
with a
of 14
SBP is
with a
of 8
If a Male Age 50 has a SBP = 140 and a
Female Age 50 has a SBP = 120, who has the
relatively higher SBP ?
Normal Distribution
ZM = (140  108) / 14 = 2.29
ZF = (120  100) / 8 = 2.50
Which is more extreme?
Percentiles of the Normal
Distribution
The kth percentile is defined as the score that
holds k percent of the scores below it.
Eg., 90th percentile is the score that holds
90% of the scores below it.
Q1 = 25th percentile, median = 50th percentile,
Q3 = 75th percentile
Percentiles
For the normal distribution, the following is used
to compute percentiles:
X=+Z
where
= mean of the random variable X,
= standard deviation, and
Z = value from the standard normal distribution
for the desired percentile (e.g., 95th, Z=1.645).
95th percentile of BMI for Men: 29+1.645(6) = 38.9
Central Limit Theorem
(Nonnormal) population with
Take samples of size n as long as n is
sufficiently large (usually n > 30 suffices)
The distribution of the sample mean is
approximately normal, therefore can use
Z to compute probabilities
x
Z
n
Standard error
Statistical Inference
There are two broad areas of statistical
inference, estimation and hypothesis testing.
Estimation. Population parameter is unknown,
sample statistics are used to generate estimates.
Hypothesis Testing. A statement is made about
parameter, sample statistics support or refute
statement.
What Analysis To Do When
Nature of primary outcome variable
Continuous, dichotomous, categorical,
time to event
Number of comparison groups
One, 2 independent, 2 matched or
paired, > 2
Associations between variables
Regression analysis
Estimation
Process of determining likely values for
unknown population parameter
Point estimate is best singlevalued
estimate for parameter
Confidence interval is range of values for
parameter:
point estimate + margin of error
point estimate + t SE (point estimate)
Hypothesis Testing Procedures
1. Set up null and research
hypotheses, select
2. Select test statistic
3. Set up decision rule
4. Compute test statistic
5. Draw conclusion & summarize
significance (pvalue)
Pvalues
Pvalues represent the exact
significance of the data
Estimate pvalues when rejecting H0
to summarize significance of the data
(approximate with statistical tables,
exact value with computing package)
If p < then reject H0
Errors in Hypothesis Tests
Conclusion of Statistical Test
Do Not Reject H0
Reject H0
H0 true
H0 false
Correct
Type I error
Type II error
Correct
Continuous Outcome
Confidence Interval for
Continuous outcome  1 Sample
n > 30
n < 30
XZ
Xt
s
n
Example.
95% CI for mean waiting time at ED
Data: n=100, =37.85 and s=9.5
mins
37.85 1.96
37.85 + 1.86
(35.99 to 39.71)
Statistical computing packages use t throughout.
9.5
100
New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
One study sample
Data
On each participant, measure outcome
(yes/no)
x
n, x=# positive responses, p
Dichotomous Outcome
Confidence Interval for p
Dichotomous outcome  1 Sample
p(1  p)
p Z
n
min[np, n(1 p)] 5
otherwise, exact procedures
Example.
In the Framingham Offspring
Study (n=3532), 1219 patients
were on antihypertensive
medications. Generate 95% CI.
0.345 1.96
0.345(1  0.345)
3532
0.345 + 0.016
(0.329, 0.361)
One Sample Procedures Comparisons
with Historical/External Control
Continuous
H0: 0
Dichotomous
H0: pp0
H1: 0, <0, 0
n>30
n<30
X  0
s/ n
H1: pp0, <p0, p0
Z
p  p 0
p 0 (1  p 0 )
n
X  0
min[np0 , n(1 p 0 )] 5
s/ n
otherwise, exact procedures
One Sample Procedures Comparisons
with Historical/External Control
Categorical or Ordinal outcome
2 Goodness of fit test
H0: p1p10, p2p20, . . . , pkpk0
H1: H0 is false
(O  E )
=
E
2
New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two independent study samples
Data
On each participant, identify group and
measure outcome
n , X , s 2 (or s ), n , X , s 2 (or s )
1
Two Independent Samples
Cohort Study  Set of Subjects Who
Meet Study Inclusion Criteria
Group 1
Group 2
Mean Group 1 Mean Group 2
Two Independent Samples
RCT: Set of Subjects Who Meet
Study Eligibility Criteria
Randomize
Treatment 1
Mean Trt 1
Treatment 2
Mean Trt 2
Continuous Outcome
Confidence Interval for (
Continuous outcome  2 Independent Samples
n1>30 and n2>30
1
1
(X1  X 2 ) ZSp
n1 n 2
n1<30 or n2<30
1
1
(X1  X 2 ) tSp
n1 n 2
Sp
(n 1 1)s 12 (n 2 1)s 22
n1 n 2 2
Hypothesis Testing for (
Continuous outcome
2 Independent Sample
H0: 2
(2 = 0)
H1: 2, <2, 2
Hypothesis Testing for (
Test Statistic
n1>30 and n2> 30
n1<30 or n2<30
X1  X 2
1
1
Sp
n1 n 2
X1  X 2
1
1
Sp
n1 n 2
An RCT is planned to show the efficacy of
a new drug vs. placebo to lower total
cholesterol.
What are the hypotheses?
1. H0: P=N H1: P>N
2. H0: P=N H1: P<N
3. H0: P=N H1: PN
New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
Two independent study samples
Data
On each participant, identify group and
measure outcome (yes/no)
n ,p
, n , p
1
Dichotomous Outcome
Confidence Interval for (pp
Dichotomous outcome  2 Independent Samples
min[n 1p1 , n1 (1 p1 ), n 2 p 2 , n 2 (1 p 2 )] 5
p1 (1  p1 ) p 2 (1 p 2 )
(p1  p 2 ) Z
n1
n2
Measures of Effect for
Dichotomous Outcomes
Outcome = dichotomous (Y/N or 0/1)
Risk=proportion of successes = x/n
Odds=ratio of successes to failures=x/(nx)
Measures of Effect for
Dichotomous Outcomes
Risk Difference = p1  p 2
Relative Risk = p1/p 2
Odds Ratio = p1 /(1 p1 )
p 2 /(1 p 2 )
Confidence Intervals for Relative
Risk (RR)
Dichotomous outcome
2 Independent Samples
(n 1  x1 )/x 1 (n 2  x 2 )/x 2
ln( RR) Z
n1
n2
exp(lower limit), exp(upper limit)
Confidence Intervals for Odds Ratio
(OR)
Dichotomous outcome
2 Independent Samples
1
1
1
1
ln( OR) Z
x1 (n 1 x1 ) x 2 (n 2 x 2 )
exp(lower limit), exp(upper limit)
Hypothesis Testing for (p1p2)
Dichotomous outcome
2 Independent Sample
H0: p1=p2
H1: p1>p2, p1<p2, p1p2
Test Statistic
min[n 1p1 , n1 (1 p1 ), n 2 p 2 , n 2 (1 p 2 )] 5
Z
p1  p 2
1 1
p(1  p)
n1 n 2
Two (Independent) Group
Comparisons
Difference in birth
weight is 106 g,
95% CI for difference
in mean Birth weight:
(175.3 to 36.7)
New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two matched study samples
Data
On each participant, measure outcome
under each experimental condition
Compute differences (D=X1X2)
n, X d , s d
Two Dependent/Matched Samples
Subject ID
1
2
.
.
Measure 1
55
70
42
60
Measure 2
Measures taken serially in time or under
different experimental conditions
Crossover Trial
Treatment
Treatment
Eligible
R
Participants
Placebo
Placebo
Each participant measured on Treatment and placebo
Confidence Intervals for d
Continuous outcome
2 Matched/Paired Samples
n > 30
sd
Xd Z
n
n < 30
sd
Xd t
n
Hypothesis Testing for d
Continuous outcome
2 Matched/Paired Samples
H0: d
H1: d, d<0, d0
Test Statistic
n>30
n<30
Xd  d
sd
Xd  d
sd
Independent Vs Matched Design
Statistical Significance versus
Effect Size
Pvalue summarizes significance
Confidence intervals give magnitude
of effect
(If null value is included in CI, then
no statistical significance)
The null value of a difference in
means is
1.
2.
3.
4.
0
0.5
1
2
The null value of a mean difference
is
1.
2.
3.
4.
0
0.5
1
2
The null value of a relative risk is
1.
2.
3.
4.
0
0.5
1
2
The null value of a difference in
proportions is
1.
2.
3.
4.
0
0.5
1
2
The null value of an odds ratio is
1.
2.
3.
4.
0
0.5
1
2
A two sided test for the equality of
means produces p=0.20. Reject H0?
1. Yes
2. No
3. Maybe
Hypothesis Testing for More than 2
Means  Analysis of Variance
Continuous outcome
k Independent Samples, k > 2
H0: 2 k
H1: Means are not all equal
Test Statistic
n j (X j X) 2 /(k 1)
(X X j ) 2 /(N k)
F is ratio of between group variation to within group variation (error)
ANOVA Table
Source of
Variation
Sums of
Mean
Squares
df Squares F
Between
2
SSB = n j (X j  X )
Treatments
k1 SSB/k1 MSB/MSE
2
Error
SSE = Nk
(X SSE/Nk
X j)
Total
2
N1
(
)
SST = X X
ANOVA
When the sample sizes are equal, the
design is said to be balanced
Balanced designs give greatest power
and are more robust to violations of
the normality assumption
Extensions
Multiple Comparison Procedures
Used to test for specific differences in
means after rejecting equality of all
means (e.g., Tukey, Scheffe)
HigherOrder ANOVA  Tests for
differences in means as a function of
several factors
Extensions
Repeated Measures ANOVA  Tests for
differences in means when there are
multiple measurements in the same
participants (e.g., measures taken
serially in time)
2 Test of Independence
Dichotomous, ordinal or categorical outcome
2 or More Samples
H0: The distribution of the outcome is
independent of the groups
H1: H0 is false
Test Statistic
2
(O
E)
2
E
2 Test of Independence
Data organization (r by c table)
Outcome
Group
20%
40%
40%
50%
25%
25%
90%
5%
5%
Is there distribution of the outcome different
(associated with) groups
What Tests Were Used?
In Framingham Heart Study, we want to
assess risk factors for Impaired Glucose
Outcome = Glucose Category
Diabetes (glucose > 126),
Impaired Fasting Glucose (glucose 100125),
Normal Glucose
Risk Factors
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
What test would be used to assess whether
sex is associated with Glucose Category?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess whether
age is associated with Glucose Category?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess whether
BMI is associated with Glucose Category?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
In Framingham Heart Study, we want to
assess risk factors for Glucose Level
Consider a Secondary Outcome =
Fasting Glucose Level
Risk Factors
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
What test would be used to assess whether
sex is associated with Glucose Level?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess whether
BMI is associated with Glucose Level?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess whether
age is associated with Glucose Level?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
In Framingham Heart Study, we want to
assess risk factors for Diabetes
Consider a Tertiary Outcome =
Diabetes Vs No Diabetes
Risk Factors
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
What test would be used to assess
whether sex is associated with Diabetes?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess
whether BMI is associated with Diabetes?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
What test would be used to assess
whether age is associated with Diabetes?
1.
2.
3.
4.
5.
ANOVA
ChiSquare GOF
ChiSquare test of independence
Test for equality of means
Other
Correlation
Correlation (r) measures the nature
and strength of linear association
between two variables at a time
Regression equation that best
describes relationship between
variables
What is the most likely value of r
for the data shown below?
Y
*
*
*
*
*
*
*
*
*
*
*
1.
2.
3.
4.
r=0.5
r=0
r=0.5
r=1
What is the most likely value of r
for the data shown below?
Y
*
*
*
1.
2.
3.
4.
* *
r=0.5
r=0
r=0.5
r=1
* * *
Simple Linear Regression
Y = Dependent, Outcome variable
X = Independent, Predictor variable
= b 0 + b1 x
b0 is the Yintercept, b1 is the slope
Simple Linear Regression
Assumptions
Linear relationship between X and Y
Independence of errors
Homoscedasticity (constant variance) of
the errors
Normality of errors
Multiple Linear Regression
Useful when we want to jointly
examine the effect of several X
variables on the outcome Y variable.
Y = continuous outcome variable
X1, X2, , Xp = set of independent or
predictor variables
.
y = b0 + b1 x1 + b2 x 2 + . . . + bp x p
Multiple Regression Analysis
Model is conditional, parameter
estimates are conditioned on other
variables in model
Perform overall test of regression
If significant, examine individual
predictors
Relative importance of predictors by pvalues (or standardized coefficients)
Multiple Regression Analysis
Predictors can be continuous,
indicator variables (0/1) or a set of
dummy variables
Dummy variables (for categorical
predictors)
Race: white, black, Hispanic
Black (1 if black, 0 otherwise)
Hispanic (1 if Hispanic, 0 otherwise)
Definitions
Confounding the distortion of the
effect of a risk factor on an outcome
Effect Modification a different
relationship between the risk factor
and an outcome depending on the
level of another variable
Multiple Regression for SBP:
Comparison of Parameter Estimates
Simple Models
Age
1.03
<.0001
Male
2.26
.0009
BMI
1.80
<.0001
BP Meds 33.38
<.0001
Multiple Regression
p
0.86
<.0001
2.22
.0002
1.48
<.0001
24.12
<.0001
Focus on the association between BP meds and SBP
RCT of New Drug to Raise HDL
Example of Effect Modification
Women
Mean
Std Dev
New drug
40
38.88
3.97
Placebo
41
39.24
4.21
Men
Mean
Std Dev
New drug
10
45.25
1.89
Placebo
39.06
2.22
Simple Logistic Regression
Outcome is dichotomous (binary)
We model the probability p of having
the disease.
b 0 b1X
e
p
b 0 b1X
1 e
logit( p) ln
p
b 0 b1x
1 p
Multiple Logistic Regression
Outcome is dichotomous (1=event,
0=nonevent) and p=P(event)
Outcome is modeled as log odds
p
b 0 b1x1 b 2 x 2 ... b p x p
ln
1  p
Multiple Logistic Regression for
Birth Defect (Y/N)
Predictor b
p OR (95% CI for OR)
Intercept
1.099 0.0994
Smoke
1.062 0.2973 2.89 (0.34, 22.51)
Age
0.298 0.0420 1.35 (1.02, 1.78)
Interpretation of OR for age:
The odds of having a birth defect for the older of two
mothers differing in age by one year is estimated to
be 1.35 times higher after adjusting for smoking.
Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.
Measure whether person has event or not
(Yes/No) and if so, their time to event.
Determine factors associated with longer
survival.
Survival Analysis
Incomplete followup information
Censoring
Measure followup time and not time to
event
We know survival time > followup time
Log rank test to compare survival in
two or more independent groups
Survival Curve Survival Function
Comparing Survival Curves
H0: Two survival curves are equal
2 Test with df=1. Reject H0 if 2 > 3.84
2 = 6.151. Reject H0.
Cox Proportional Hazards Model
Model:
ln(h(t)/h0(t)) = b1X1 + b2X2 + + bpXp
Exp(bi) = hazard ratio
Model used to jointly assess effects of
independent variables on outcome
(time to an event).
Outcome= allcause mortality
Age and Sex as predictors
bi
p
HR
Age
0.11149 0.00011.118
Male Sex 0.67958 0.00011.973
Sample Size Determination
Need sample to ensure precision in
analysis
Sample size determined based on
type of planned analysis
CI
Test of hypothesis
Determining Sample Size for
Confidence Interval Estimates
Goal is to estimate an unknown
parameter using a confidence interval
estimate
Plan a study to sample individuals,
collect appropriate data and generate
CI estimate
How many individuals should we
sample?
Determining Sample Size for
Confidence Interval Estimates
Confidence intervals:
point estimate + margin of error
Determine n to ensure small margin
of error (precision) accounting for
attrition!
Must specify desired margin of error,
confidence level and variability of
parameter
Determining Sample Size for
Hypothesis Testing
How many participants are needed to
ensure that there is a high probability of
rejecting H0 when it is really false?
Determine n to ensure high power
(usually 80% or 90%) accounting for
attrition!
Must specify desired power, and effect
size (difference in parameter under H0
versus H1)
Determining Sample Size for
Hypothesis Testing
and Power are related to the sample
size, level of significance () and the
effect size (difference in parameter of
interest under H0 versus H1)
Power is higher with larger a
Power is higher with larger effect size
Power is higher with larger sample size
Sample Size Determination
Critical
Ethical
Sometimes difficult