You are on page 1of 68

Review of Sessions 1-7

PUBH 614
Spring 2019
Types of variables

2
Random Assignment
vs. Random Sampling

3
Shape of a Distribution: Modality
Does the histogram have a single prominent peak (unimodal), several prominent
peaks (bimodal/multimodal), or no apparent peaks (uniform)?

Note: In order to determine modality, step back and imagine a smooth curve over
the histogram -- imagine that the bars are wooden blocks and you drop a limp
spaghetti over them, the shape the spaghetti would take could be viewed as a
smooth curve.
Shape of a Distribution: Skewness
Is the histogram right skewed, left skewed, or symmetric?

Note: Histograms are said to be skewed to the side of the long tail.
Shape of a Distribution: Box Plot
Normal Distribution
● Unimodal and symmetric, bell-shaped curve.
● Notation: X ~ N(µ, σ) means the random variable, X, is normally
distributed with mean µ and standard deviation σ

The mean, µ,
and standard
deviation, σ, are
examples population
parameters

Note there are


infinitely many
Normal
distributions,
depending on
the parameters
µ and σ.
Standardizing with Z scores
  Any random variable defined as X ~ N(µ, σ) can be standardized so
that only one normal distribution is needed for obtaining
probabilities (percentiles).
1. First, convert to standardized scores, or Z scores, such that :

2. Then obtain probabilities from the standard normal


distribution, defined by

● Note that µ=0 and σ=1 for the standard normal distribution
Normal distribution

The same distribution but standardized


Consider Adult systolic blood pressure (SBP), known to be normally
distributed with µ = 120 and σ = 20.

Original
distribution

Same
distribution
now
standardized
10
•  1. What percentage of adults have SBP less than 100?
P(X<100) = P( )=P(z<-1)=.1587

=standardize(x,mean,sd)
=NORM.DIST(x,mean,sd,true) =standardize(100,120,20)
=NORM.DIST(100,120,20,TRUE) z= -1
Pr(x < 100)= 0.1587 =norm.s.dist(z, TRUE)
=norm.s.dist(-1, TRUE)
Pr(x < 100)= 0.1587
15.9% of adults have systolic blood pressure less than 100

11
2. What percentage of adults have SBP greater than 100?
• P(X>100) = 1 – P(X<100)
• We know that P(X<100) = .1587 from before
• P(X>100)= 1-.1587=.8413
=1-norm.s.dist(z, TRUE)
=1-norm.s.dist(-1, TRUE)
Pr(x > 100)= 0.8413
or
=1-norm.dist(x,mean,sd,true)
=1-norm.dist(100,120,20,TRUE)
Pr(x > 100)= 0.8413

84.1% of adults have systolic blood pressure greater than 100

12
•  3. What percentage of adults have SBP greater than 133?
P(X>133) = 1 – P(X<133)
P()=P(Z<.65)=.7422
P(X>133)= 1- .7422=.2578

=standardize(133,120,20)
z= 0.65

=1-norm.s.dist(z, TRUE)
=1-norm.s.dist(.65, TRUE)
Pr(x > 133)= 0.2578
or
=1-norm.dist(x,mean,sd,true)
=1-norm.dist(133,120,20,TRUE)
Pr(x > 133)= 0.2578

25.8% of adults have systolic blood pressure greater than 133

13
•  What percentage of adults have SBP between 100 and 133?
4.
P(100<X<133) = P(X<133) - P(X<100)

P() - P()= P(Z<.65) - P(Z<-1)=. =.7422-.1587=.5835

STANDARDIZE NORM.S.DIST
X Z P(Z<z) Pr(-1 < Z < .65)
100 -1 0.1587 0.5835
133 0.65 0.7422
or NORM.DIST P(100 < X< 133)
X P(X<x) 0.5835
100 0.1587
133 0.7422

58% of adults have systolic blood pressure between 100 and 133

14
Finding cutoff points

•  Since ,
then .

Therefore, just as we can find P(Z < z) or P(Z > z), we


can also find z, and therefore x, given P(Z < z).

15
Finding cutoff points
Body temperatures of healthy humans are distributed nearly normally with mean
98.2oF and standard deviation 0.73oF. What is the cutoff for the lowest 3% of human
body temperatures?

Mackowiak, Wasserman, and Levine (1992), A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the
Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlick.
Practice
Body temperatures of healthy humans are distributed nearly normally with mean
98.2oF and standard deviation 0.73oF. What is the cutoff for the highest 10% of human
body temperatures?
68-95-99.7 Rule
For nearly normally distributed data,
● about 68% falls within 1 SD of the mean,
● about 95% falls within 2 SD of the mean,
● about 99.7% falls within 3 SD of the mean.
It is possible for observations to fall 4, 5, or more standard deviations away from the mean,
but these occurrences are very rare if the data are nearly normal.
Contingency Tables
A table that summarizes data for two categorical variables is called a contingency
table.

The contingency table below shows the distribution of students' genders and
whether or not they are looking for a spouse while in college.
Binomial distribution
The ultimate formula for the binomial
probability distribution is

where p represents probability of success, (1-p)


represents probability of failure,
n represents number of independent trials, and
k represents number of successes.
Practice
A 2012 Gallup survey suggests that 26.2% of Americans are obese.
Among a random sample of 10 Americans, what is the probability
that exactly 8 are obese?
Expected value
A 2012 Gallup survey suggests that 26.2% of Americans are obese.
Among a random sample of 100 Americans, how many would you
expect to be obese?
● Easy enough, 100 x 0.262 = 26.2.
● Or more formally, µ = np = 100 x 0.262 = 26.2.
● But this doesn't mean in every random sample of 100 people
exactly 26.2 will be obese. In fact, that's not even possible. In some
samples this value will be less, and in others more. How much
would we expect this value to vary?
Expected value and its variability
Mean and standard deviation of binomial distribution

Going back to the obesity rate:

We would expect 26.2 out of 100 randomly sampled Americans to be


obese, with a standard deviation of 4.4.
_________
Note: Mean and standard deviation of a binomial might not always be whole
numbers, and that is alright, these values represent what we would expect to see
on average.
Distributions of number
of successes
Hollow histograms of samples from the binomial model where p =
0.10 and n = 10, 30, 100, and 300. What happens as n increases?
The the binomial distribution may be
approximated by the standard
normal distribution

This is generally true under the following conditions.

np ≥ 10 and n(1 - p) ≥ 10
Practice
Below are four pairs of Binomial distribution parameters. Which
distribution can be approximated by the normal distribution?
1. n = 100, p = 0.95
2.
A. n
n == 100,
25, pp==0.45
0.95
3.
B. n
n == 25,
150,pp==0.45
0.05
4.
C. n
n == 150,
500, p
p == 0.05
0.015
D. n = 500, p = 0.015
Practice
A study found that approximately 25% of Facebook users are considered
power users. The same study found that the average Facebook user has 245
friends. What is the probability that the average Facebook user with 245
friends has 70 or more friends who would be considered power users?

Note:
We are given that n = 245, p = 0.25, and we are asked for the
probability P(K ≥70). Assuming independence, we can proceed
as follows …
P(X ≥ 70)
= P(K = 70 or K = 71 or K = 72 or … or K = 245)
= P(K = 70) + P(K = 71) + P(K = 72) + … + P(K = 245)

In other words, “an awful lot of work” ...


Normal approximation
to the binomial
Practice
What is the probability that the average Facebook user with 245
friends has 70 or more friends who would be considered power
users?
The sampling distribution of a mean (SDM)
- Illustrated here through a simulation experiment

• Population
N = 10,000
Plot: positive skew
μ = 173
σ = 30

• Take repeated SRSs, each of n = 10


• Calculate the mean from each
sample
• Plot these means
• This distribution simulates a
sampling distribution of the mean
(SDM)
• Finding 1 (central limit theorem)
The sampling distribution of x-bar
tends toward normality even when
the population distribution is not
normal. This effect becomes
stronger as the sample gets larger.
• Finding 2 (unbiasedness)
The expected value of x-bar = the
expected value of x = the
population mean μ.
• Finding 3 (standard error)


x 
n
Sampling Behavior of Counts and Proportions
• Recall binomials: the count
of successes (X) follows a
binomial distribution with X
~ b(n, p)

• We can re-express the


count as a proportion by
dividing it by n

• The sampling distribution


X~b(10,0.2) is shown to the
right
Normal Approximation to the Binomial

• When n is large, the binomial distribution takes on


properties of the normal distribution.

• Recall the “rule-of-thumb” that the normal


approximation applies to the binomial when np
≥ 10 and nq ≥ 10, where q= 1-p.
Normal Approximation for a
Binomial Sample Count
  np and   npq


X ~ N np, npq 
when normal approximation applies.
Normal Approximation for a
Binomial Sample Proportion
pq
  p and  
n
 pq 
pˆ ~ N  p,
 n 

when normal approximation applies.
For our example of X~b(100,0.2)
• The objective of estimation is to predict the value of a
parameter
• There are two forms of estimation:
• Point estimation: e.g., xbar is an unbiased estimator of μ
• Interval estimation: surround the point estimate with a
margin of error to create a confidence interval:

*SE
 
Reasoning Behind the Confidence Interval
Hypothesis Testing
• Also called significance testing

• Tests claims about parameters

• Four steps testing procedure:


1. State the null and alternative hypotheses.
2. Identify and calculate an appropriate test statistic.
3. Convert the test statistic to a P-value.
4. Compare the p-value to an accepted level of
significance (α) for deciding which hypothesis to reject.
Null and Alternative Hypotheses
• The null hypothesis (H0) is a claim of “no difference”

• The alternative hypothesis (Ha) is a claim of


“difference”

• The idea: Seek evidence against the claim of H0 as a


way of bolstering Ha
Decision errors
There are two competing hypotheses: the null and the
alternative. In a hypothesis test, we make a decision about which
might be true, but our choice might be incorrect.

● A Type 1 Error is rejecting the null hypothesis when H0 is true.


● A Type 2 Error is failing to reject the null hypothesis when HA is
true.
Type 1 error
● The p-value is the probability of making a type 1 error, P(reject
H0 | H0 true) . It is compared to a significance level called α,
where a decision is made such that:
reject H0 in favor of Ha if p-value < α

● In scientific research, a general decision rule is:

 P ≥ 0.10: insignificant evidence against H0


 0.05 ≤ P < 0.10: marginally significant evidence against
H0
 0.01 ≤ P < 0.05: significant evidence against H0
 P < 0.01: highly significant evidence against H0
Type II Error

• Probability of failing to reject a false Ho,


designated as β (beta)

• Power = (1- β )
• = Probability that a statistical test will be able to
detect a true difference
• Commonly set to 80% for designing a study
One-Sample t Test
A. Hypotheses:
H0: µ = µ0 vs. Ha: µ ≠ µ0 (two-sided)
or Ha: µ < µ0 (left-sided) or Ha: µ > µ0 (right-sided)
x  0 s
B. Test statistic: tstat  where SE x 
SE x n
with df  n  1

C. P-value: Use a table or software to convert tstat to tail areas (P-value)

D. Significance level: Small P  strong evidence against H0.


Confidence Interval for µ
s
(1   )100% CI for   x  t n 1,1  
2
n
Typical “point estimate ± margin of error” formula
tn-1,1-α/2 is from t table
Similar to z procedure except uses s instead of σ
Similar to z procedure except uses t instead of z
Alternative formula:
s
x  t n 1,1   SE x where SE x 
2
n
Paired Samples
• One set of observational units (i.e. people) with two
sets of measurements, such as “before and after a
treatment”
• Also called “matched-pair” and “dependent” samples

• Contrasted with Independent samples, where where


measurements are obtained for two separate sets of
observational units (to be covered later…)
Significance test
Inference directed toward mean difference, (mean DELTA)

xd   0 0.38083  0
tstat    3.043
s n .4335 / 12
df  n  1  12  1  11

H0: µd = 0 vs. Ha: µd  0


tstat = 3.04 with 11 df
P value between .01 and .0005
Evidence against H0 is significant
95% Confidence Interval for µd
A t procedure directed toward the DELTA variable
calculates the confidence interval for the mean difference.
sd
(1   )100% CI for  d  xd  t n 1,1  
2
n
“Oat bran” data:

For 95% confidence use t121,1 .0 5  t11,. 975  2.201 (from Table C)
2

.4335
95% CI for  d  0.3808  2.201
12
 0.3808  0.2754
 (0.105 to 0.656)
Independent Samples

• Two separate groups are compared


to each other

• no matching or pairing
Hypothesis Test for Two Independent Samples
Hypotheses. H0: μ1 = μ2
against Ha: μ1 ≠ μ2 (two-sided)
or Ha: μ1 > μ2 (right-sided) Ha: μ1 < μ2 (left-sided)
Test statistic. 2 2
(x1 - x2 ) s s
tstat = where SE x1- x2 = + 1 2

SE x1- x2 n1 n2

Obtain degrees of freedom

P-value. Convert the tstat to P-value with t table or software

Significance level. Use P-value in usual manner as a measure of


evidence against H0
Comparison of CI formulas

(point estimate)  (t*)( SE )

Type of point
sample estimate df for t* SE
single s
x n–1 n

paired sd
xd nd – 1
n
independent
smaller of n1 – 1 s12 s 22
x1  x2 or n2 – 1

n1 n2
Recap
● Calculate required sample size for a desired level of power
● Calculate power for a range of sample sizes, then choose the sample
size that yields the target power (usually 80% or 90%)
Achieving desired power
There are several ways to increase power (and hence decrease type 2 error rate):
1. Increase the sample size
2. Decrease the standard deviation of the sample, which essentially has the
same effect as increasing the sample size (it will decrease the standard
error). With a smaller s we have a better chance of distinguishing the null
value from the observed point estimate. This is difficult to ensure but
cautious measurement process and limiting the population so that it is more
homogenous may help.
3. Increase α, which will make it more likely to reject H0 (but note that this has
the side effect of increasing the Type 1 error rate).
4. Consider a larger effect size. If the true mean of the population is in the
alternative hypothesis but close to the null value, it will be harder to detect
a difference.
2-by-2 Cross-Tabulation

Successes Failures Total


a1
Group 1 a1 b1 n1 pˆ 1 
n1
Absolute
Group 2 a2 b2 n2
a2 Risk
pˆ 2 
Total m1 m2 nT n2

a1 a 2
pˆ 1  pˆ 2     𝑝
^1
n1 n2
𝑝2
^

Risk Difference or Relative Risk


Absolute Risk Reduction (Risk for group 1, relative to
group 2)
Odds Ratios (alternative to rel. risk)
Successes Failures Total
Group 1 a1 b1 n1
Group 2 a2 b2 n2
Total m1 m2 N

• The odds for a group is the ratio of successes to failures:


ai
oi 
bi

• The odds ratio is the odds of one group, relative to another


group. For example, consider data from a 2x2 table:
ˆ a1 / b1 a1b2
OR  
a 2 / b2 a 2 b1
One-sample test
Recall from session 3: Central limit theorem for proportions
Sample proportions will be nearly normally distributed with mean equal to the
population mean, p, and standard error equal to

●   if independent observations, and at least 10 successes and 10 failures

_________
Note: If p is unknown (most cases), we use for calculating the standard error.
Two-sample test
• Hypotheses. H0: p1 = p2 against Ha:p1 ≠ p2
[Or one-sided alternatives: Ha: p1 > p2 or Ha: p1 < p2]
• Test statistic.
pˆ 1  pˆ 2
zstat  where
1 1
pq   
 n1 n2 
no. of successes, both groups combined
p
total observations, both groups combined

• P-value. Convert zstat to p-value


Chi-Square Test of Association
•  A. Hypotheses
H0: no association between row and column variables
Ha: the null hypothesis is false

 Oi  Ei 
2
B. Test statistic
 2
stat  
all cells Ei
with df  ( R  1)(C  1)

where Oi  observed count, cell i


Ei  expected count in cell i
row total  column total
Ei 
table total

C. P-value. Use a Chi-Square table or a computer to find the


Pearson Correlation Coefficient
Ranges between -1 and 1 (standardized measure = unit free)
the closer to -1, the stronger the negative linear relationship
the closer to 1, the stronger the positive linear relationship
the closer to 0, the weaker the linear relationship

1 xx y y
r
n 1
 (
sx
)(
sy
)

Figures adapted from OpenIntro


Statistics, 3rd Edition
Regression analysis
• Regression analysis expresses a functional relationship between
outcome variable y and explanatory variables x(continuous
and/or categorical variables)
• Simple linear regression (one explanatory variable)
• Multiple linear regression (two or more explanatory variables)

Uses the method of least


squares to find the line that
best describes the association,
assuming a linear relationship
between the dependent and
independent variable(s)

60
Simple linear regression model
•  • The population model : + x +
• Y - dependent variable
• - population y intercept
• - population slope coefficient
• - random error, or residual
• The sample regression line provides an estimate of the
population regression line + x
• = estimated (or predicted) y value (dependent variable)
• = estimated intercept = average y when x value is zero
• = estimated slope = change in the average y value for a
one-unit difference in x

61
Estimating the slope and intercept

•  Given n pairs of x,y

and

where r = the Pearson correlation coefficient, sy /


sx = the ratio of standard deviations of y and x and
are the sample means of y and x.
Two primary applications of Regression Modelling:

Prediction vs. Explanation

• Predict the value of a dependent variable based on the value


of an independent variable (prediction models)
• Identify groups a risk for a particular outcome (e.g. health
condition)
• The independent contribution of each risk factor is of interest

• Explain the impact of changes in an independent variable on


the dependent variable (explanatory models)
• Assess causality
• Typically based on theory
• Issues of confounding and selection bias
Hypothesis testing about a Population Slope β1
•  • Hypotheses and significance level:

H0: β = 0 H a: β ≠ 0

• Test Statistic: : = df = n-2

• Reject H0 if P-value < α = 0.05


64
Multivariable Regression models

Readily extend to two or more predictor variables


For example … Y = βo + β1 X1 + β2 X2 + e,
where e is the residual difference between what is observed and what is predicted by
the linear model.
Why Multivariable Regression models?

• Can assess the relationship between a single continuous


outcome and several explanatory variables
• Can account for confounding
• Can determine whether there is effect modification
• Notes:
• Easy to implement using statistical software BUT
should reflect plausible relationships
• The sample size n should be at least 10 times the
number of explanatory variables.

66
Confounding overview

• Confounding variable is:


• Related to the exposure or risk factor
• Related to the outcome
• Is not in the causal pathway we are interested in

• Example: BMI and CVD, confounded by age?


Can any of the observed association between obesity and
CVD by attributable to age?
67
Logistic Regression
• When outcome variable is binary outcome (success or failure)

• the straight-line model is usually inadequate

• a more realistic model has a curved S-shape

• success is typically coded as 1 and failure is coded as 0

• Understand the interpretation of logistic regression output

68

You might also like