Professional Documents
Culture Documents
Biostatistics
An archived version of this review, along with the PPT file, will
be available on the NBPHE website (www.nbphe.org) under
Study Resources
Biostatistics
Two Areas of Applied Biostatistics:
Descriptive Statistics
Summarize a sample selected from a
population
Inferential Statistics
Make inferences about population
parameters based on sample statistics.
Variable Types
Dichotomous variables have 2 possible
responses (e.g., Yes/No)
Ordinal and categorical variables have
more than two responses and responses
are ordered and unordered, respectively
Continuous (or measurement) variables
assume in theory any values between a
theoretical minimum and maximum
1.
2.
3.
4.
Dichotomous
Ordinal
Categorical
Continuous
1.
2.
3.
4.
Dichotomous
Ordinal
Categorical
Continuous
Freq.
Rel. Freq.
Cumulative
Freq
Cumulative
Rel. Freq.
Excellent
19
38%
19
38%
Very Good
12
24%
31
62%
Good
18%
40
80%
Fair
12%
46
92%
Poor
8%
50
100%
n=50
100%
Ordinal variables only
Continuous Variables
Assume, in theory, any value between
a theoretical minimum and maximum
Quantitative, measurement variables
Example systolic blood pressure
Q1
Median
Q3
Max
100
110
120
130
140
150
160
20
15
10
5
0
I
1.
2.
3.
4.
II
III
IV
1.
2.
3.
4.
Mean
Median
Interquartile range
Standard Deviation
1.
2.
3.
4.
Mean
Median
Interquartile range
Standard Deviation
Ordinal
Frequencies, relative frequencies,
cumulative frequencies and cumulative
relative frequencies
Histograms (freq. or relative freq.
Continuous
n, and s or median and IQR (if outliers)
Box whisker plot
20 15 15 30 80
5 15 25 25 70
25 30 40 55 150
1. 20/25
2. 20/80
3. 20/150
Total
20 15 15 30 80
5 15 25 25 70
25 30 40 55 150
1. 95/150
2. 45/80
3. 55/150
Total
Free of CVD
Men
35
265
Women
45
355
1. 35/80
2. 35/265
3. 35/300
Free of CVD
Men
35
265
Women
45
355
1. 35/700
2. 35/80
3. 80/300
No
Yes
No
215
25
Yes
90
15
No Disease
Symptoms
25
225
250
No Symptoms
50
450
500
1. No
2. Yes
Total
Probability Models
Binomial Distribution
Two possible outcomes: success and
failure
Replications of process are independent
P(success) is constant for each
replication
n!
P(x)
p x (1 p) n x
x!(n x)!
Mean=np, variance=np(1-p)
Probability Models
Poisson Distribution
Two possible outcomes: success and
failure
Replications of process are independent
Often used to model counts (often used
to model rare events)
P(x) (e ) / x!
-
Mean=m, variance=m
Probability Models
Normal Distribution
Model for continuous outcome
Mean=median=mode
Normal Distribution
Properties of Normal Distribution
I) The normal distribution is symmetric about the
mean (i.e., P(X > ) = P(X < ) = 0.5).
ii) The mean and variance ( and 2) completely
characterize the normal distribution.
iii) The mean = the median = the mode
iv) Approximately 68% of obs between mean + 1 sd
95% between mean + 2 sd, and >99% between
mean + 3 sd
Normal Distribution
Body mass index (BMI) for men age 60 is
normally distributed with a mean of 29 and
standard deviation of 6.
What is the probability that
a male has BMI < 29?
P(X<29)= 0.5
11
17
23
29
35
41
47
Normal Distribution
What is the probability that a male has BMI less than
30?
P(X<30)=?
11
17
23
29
35
41
47
-3
-2
-1
Normal Distribution
x 30 29
Z
0.17
6
P(X<30)= P(Z<0.17) = 0.5675
From a table of standard normal
probabilities or statistical
computing package.
SBP
is
with a
of 14
SBP is
with a
of 8
Normal Distribution
ZM = (140 - 108) / 14 = 2.29
ZF = (120 - 100) / 8 = 2.50
Which is more extreme?
Percentiles
For the normal distribution, the following is used
to compute percentiles:
X=+Z
where
= mean of the random variable X,
= standard deviation, and
Z = value from the standard normal distribution
for the desired percentile (e.g., 95th, Z=1.645).
95th percentile of BMI for Men: 29+1.645(6) = 38.9
x
Z
n
Standard error
Statistical Inference
There are two broad areas of statistical
inference, estimation and hypothesis testing.
Estimation. Population parameter is unknown,
sample statistics are used to generate estimates.
Hypothesis Testing. A statement is made about
parameter, sample statistics support or refute
statement.
Estimation
Process of determining likely values for
unknown population parameter
Point estimate is best single-valued
estimate for parameter
Confidence interval is range of values for
parameter:
point estimate + margin of error
point estimate + t SE (point estimate)
P-values
P-values represent the exact
significance of the data
Estimate p-values when rejecting H0
to summarize significance of the data
(approximate with statistical tables,
exact value with computing package)
If p < then reject H0
Correct
Type I error
Type II error
Correct
Continuous Outcome
Confidence Interval for
Continuous outcome - 1 Sample
n > 30
n < 30
XZ
Xt
s
n
Example.
95% CI for mean waiting time at ED
Data: n=100, =37.85 and s=9.5
mins
37.85 1.96
37.85 + 1.86
(35.99 to 39.71)
9.5
100
New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
One study sample
Data
On each participant, measure outcome
(yes/no)
x
n, x=# positive responses, p
Dichotomous Outcome
Confidence Interval for p
Dichotomous outcome - 1 Sample
p(1 - p)
p Z
n
min[np, n(1 p)] 5
otherwise, exact procedures
Example.
In the Framingham Offspring
Study (n=3532), 1219 patients
were on antihypertensive
medications. Generate 95% CI.
0.345 1.96
0.345(1 - 0.345)
3532
0.345 + 0.016
(0.329, 0.361)
Dichotomous
H0: pp0
H1: 0, <0, 0
n>30
n<30
X - 0
s/ n
p - p 0
p 0 (1 - p 0 )
n
X - 0
min[np0 , n(1 p 0 )] 5
s/ n
(O - E )
=
E
2
New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two independent study samples
Data
On each participant, identify group and
measure outcome
n , X , s 2 (or s ), n , X , s 2 (or s )
1
Group 1
Group 2
Mean Group 1 Mean Group 2
Treatment 1
Mean Trt 1
Treatment 2
Mean Trt 2
Continuous Outcome
Confidence Interval for (
Continuous outcome - 2 Independent Samples
n1>30 and n2>30
1
1
(X1 - X 2 ) ZSp
n1 n 2
n1<30 or n2<30
1
1
(X1 - X 2 ) tSp
n1 n 2
Sp
(n 1 1)s 12 (n 2 1)s 22
n1 n 2 2
(2 = 0)
H1: 2, <2, 2
n1<30 or n2<30
X1 - X 2
1
1
Sp
n1 n 2
X1 - X 2
1
1
Sp
n1 n 2
New Scenario
Outcome is dichotomous
Result of surgery (success, failure)
Cancer remission (yes/no)
Two independent study samples
Data
On each participant, identify group and
measure outcome (yes/no)
n ,p
, n , p
1
Dichotomous Outcome
Confidence Interval for (pp
Dichotomous outcome - 2 Independent Samples
min[n 1p1 , n1 (1 p1 ), n 2 p 2 , n 2 (1 p 2 )] 5
p1 (1 - p1 ) p 2 (1 p 2 )
(p1 - p 2 ) Z
n1
n2
p 2 /(1 p 2 )
(n 1 - x1 )/x 1 (n 2 - x 2 )/x 2
ln( RR) Z
n1
n2
exp(lower limit), exp(upper limit)
1
1
1
1
ln( OR) Z
x1 (n 1 x1 ) x 2 (n 2 x 2 )
exp(lower limit), exp(upper limit)
p1 - p 2
1 1
p(1 - p)
n1 n 2
New Scenario
Outcome is continuous
SBP, Weight, cholesterol
Two matched study samples
Data
On each participant, measure outcome
under each experimental condition
Compute differences (D=X1-X2)
n, X d , s d
Measure 1
55
70
42
60
Measure 2
Crossover Trial
Treatment
Treatment
Eligible
R
Participants
Placebo
Placebo
sd
Xd Z
n
n < 30
sd
Xd t
n
Xd - d
sd
Xd - d
sd
0
0.5
1
2
0
0.5
1
2
0
0.5
1
2
0
0.5
1
2
0
0.5
1
2
n j (X j X) 2 /(k 1)
(X X j ) 2 /(N k)
ANOVA Table
Source of
Variation
Sums of
Mean
Squares
df Squares F
Between
2
SSB = n j (X j - X )
Treatments
k-1 SSB/k-1 MSB/MSE
2
Error
SSE = N-k
(X -SSE/N-k
X j)
Total
2
N-1
(
)
SST = X X
ANOVA
When the sample sizes are equal, the
design is said to be balanced
Balanced designs give greatest power
and are more robust to violations of
the normality assumption
Extensions
Multiple Comparison Procedures
Used to test for specific differences in
means after rejecting equality of all
means (e.g., Tukey, Scheffe)
Higher-Order ANOVA - Tests for
differences in means as a function of
several factors
Extensions
Repeated Measures ANOVA - Tests for
differences in means when there are
multiple measurements in the same
participants (e.g., measures taken
serially in time)
2 Test of Independence
Dichotomous, ordinal or categorical outcome
2 or More Samples
H0: The distribution of the outcome is
independent of the groups
H1: H0 is false
Test Statistic
2
(O
E)
2
E
2 Test of Independence
Data organization (r by c table)
Outcome
Group
20%
40%
40%
50%
25%
25%
90%
5%
5%
Risk Factors
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
Sex
Age
BMI (normal weight, overweight, obese)
Genetics
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
ANOVA
Chi-Square GOF
Chi-Square test of independence
Test for equality of means
Other
Correlation
Correlation (r) measures the nature
and strength of linear association
between two variables at a time
Regression equation that best
describes relationship between
variables
*
*
*
*
*
*
*
*
*
*
1.
2.
3.
4.
r=-0.5
r=0
r=0.5
r=1
*
*
*
1.
2.
3.
4.
* *
r=-0.5
r=0
r=0.5
r=1
* * *
y = b0 + b1 x1 + b2 x 2 + . . . + bp x p
Definitions
Confounding the distortion of the
effect of a risk factor on an outcome
Effect Modification a different
relationship between the risk factor
and an outcome depending on the
level of another variable
Age
1.03
<.0001
Male
-2.26
.0009
BMI
1.80
<.0001
BP Meds 33.38
<.0001
Multiple Regression
p
0.86
<.0001
-2.22
.0002
1.48
<.0001
24.12
<.0001
Mean
Std Dev
New drug
40
38.88
3.97
Placebo
41
39.24
4.21
Men
Mean
Std Dev
New drug
10
45.25
1.89
Placebo
39.06
2.22
e
p
b 0 b1X
1 e
logit( p) ln
p
b 0 b1x
1 p
p
b 0 b1x1 b 2 x 2 ... b p x p
ln
1 - p
Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.
Measure whether person has event or not
(Yes/No) and if so, their time to event.
Determine factors associated with longer
survival.
Survival Analysis
Incomplete follow-up information
Censoring
Measure follow-up time and not time to
event
We know survival time > follow-up time