You are on page 1of 42

MED-0323: Introducción a la Bioestadística

Clase 14: Regresión lineal II

Noviembre 15, 2019


Iván Sisa, MD, MPH, MS
isisa@usfq.edu.ec
Linear regression

Farzad Noubary, PhD


Lecture outline
• Definition of indicator variables
• Interpretation of regression coefficients for
indicator variables
• Relationship between linear regression and t-
test
• Relationship between linear regression and
ANOVA
Goals of lecture
• At the end of this lecture, you will be able to:
– Interpret regression coefficients for indicator
variables
– Compare linear regression to a t-test and ANOVA
Previous classes
• Simple linear regression
– Continuous outcome (required), continuous
covariate
– Relationship to correlation
Dichotomous predictors
Big picture
• One of the biggest advantages of linear regression is
that we can incorporate continuous, dichotomous,
and categorical covariates/predictors/explanatory
variables
Linear regression with dichotomous
predictor
• In the previous example, we demonstrated that
there was a significant association between age and
BPF in our sample of MS patients
• We might be also interested if there was an effect of
gender on BPF in MS patients
• To do this, we could use an indicator variable, which
equals 1 for male and 0 for female. The resulting
regression equation for BPF is
E(BPF | male) = b0 + b1 *male
BPFi = b0 + b1 * malei + e i
Graph
.95

- - -
.9

- - - - -
.85
BPF

- - -
.8

- - - - -
.75

F M
gender
• The regression equation can be rewritten as
BPFfemale = b0 + e i
BPFmale = b0 + b1 + e i
• What is the meaning of the coefficients in this case?
– b0 is the mean BPF when male=0
• The females
– b0 + b1 is the mean BPF when male=1
• The males
• What is the interpretation of b1?
– For a one-unit increase in sex, there is a b1 increase in
mean of the BPF (needs to be interpreted in terms how
the variable was coded in the dataset)
– The difference in mean BPF between the males and
females
. regress bpf male

Source SS df MS Number of obs = 29


F( 1, 27) = 3.33
Model .007323547 1 .007323547 Prob > F = 0.0792
Estimated slope,
Residual
b1
.059426595 27 .002200985 R-squared
Adj R-squared
=
=
0.1097
0.0767
Total .066750142 28 .002383934 Root MSE = .04691

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

male .0371364 .0203586 1.82 0.079 -.004636 .0789087


_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865

Estimated intercept, b0
BPFˆ = bˆ0 + bˆ1 *male

BPFˆ = 0.823 + bˆ1 *male

BPFˆ = 0.823 + 0.037*male


Interpretation of results
• The final regression equation is
BPFˆ = 0.823 + 0.037 * male
• The meaning of the coefficients in this case are
– 0.823 is the estimate of the mean BPF in the female
group
– 0.037 is the estimate of the mean increase in BPF
between the males and females
– What is the estimated mean BPF in the males?
• 0.86
• How could we test if the difference between the
groups is statistically significant?
Hypothesis test
1) H0: There is no difference based on gender (b1 =0)
2) Continuous outcome, dichotomous predictor
3) Linear regression
4) Test statistic: t=1.82 (27 dof)
5) p-value=0.079
6) Since the p-value is more than 0.05, we fail to
reject the null hypothesis
7) We conclude that there is not enough evidence to
suggest that there is a significant difference in the
mean BPF in males compared to females
. regress bpf male

Source SS df MS Number of obs = 29


F( 1, 27) = 3.33
Model .007323547 1 .007323547 Prob > F = 0.0792
Estimated slope, b1
Residual .059426595 27 p-value for H0: b1=0
.002200985 R -squa red = 0.1097
95% confidence interval
A dj R- squared = 0.0767
Total .066750142 28 .002383934 Root MSE = .04691

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

male .0371364 .0203586 1.82 0.079 -.004636 .0789087


_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865
T-test
• As you hopefully remember, you could have
tested this same null hypothesis using a two
sample t-test
• Linear regression makes an equal variance
assumption, so let’s use the same assumption
for our t-test
. ttest bpf, by(male) Sample mean in females
Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
Sample mean
0
in males
22 .8228636 .0096717 .0453645 .8027502 .8429771
1 7 .86 .0196457 . 8119288 .9080712
95% CI for group difference
.0519775

combined 29 .8318276 .0090667 .0488255 .8132553 .8503998

diff -.0371364 .0203586 -.0789087 .004636

diff = mean(0) - mean(1)


Ho: diff = 0 Two-sideddegrees
p-value t
of fre edom
=
=
-1.8241
27

Estimated group
Ha: diff < 0difference Ha: diff != 0
r(|T| > |t|) = 0.0792
Ha: diff > 0
Pr(T > t) = 0.9604
Pr(T < t) = 0.0396 P

Note that the sample means and p-value for two group
comparison are the same as we obtained from our regression
analysis

Note that the estimated difference has the opposite sign


compared to the linear regression, but the magnitude is the same.
It is critical to remember the meaning of the sign
Hypothesis test
1) H0: There is no difference based on gender
2) Continuous outcome, dichotomous predictor
3) t-test
4) Test statistic: t= -1.82 (27 dof)
5) p-value=0.079
6) Since the p-value is more than 0.05, we fail to reject the
null hypothesis
7) We conclude that there is no significant difference in the
mean BPF in males compared to females

I prefer: We conclude that there is not enough evidence to


suggest that there is a significant difference in the mean
BPF in males compared to females
Amazing!!!
• We get the same result using both
approaches!!
• Why use linear regression?
– Allows more complex model through multiple
regression

– Extensions allow non-continuous outcomes

– Allows correlation among observations


Categorical predictor
• In our discussion of ANOVA, we investigated the
hypointensity of brain structures in healthy controls,
benign MS, SPMS and PPMS patients

• In this case, we had a categorical predictor and a


continuous outcome

• The question was whether there were any


differences between these groups

• How could we investigate this using linear


regression?
Review: indicator variables
• Let’s assume that our reference group is the healthy
controls

• If we were just interested in comparing the benign MS


patients to the healthy controls, we would use the
following regression equation
HIi = b0 + b1 * BMS + e i
– BMSi =0 for healthy controls
– BMSi =1 for benign MS patients
• What is the meaning of b1?
. regress hypo group if group<2

Source SS df MS Number of obs = 55


F ( 1, 53) = 3.98
Model .001717456 1 p-value for H0:
.00 1717456 P rob > F = 0.0511
Residua l .022842995 53 .000431 R -squared = 0.069 9
Estimate for b1 b1=0 95% CI
A dj R-square d = 0.052 4
Total .02456045 54 .000454823 Root MSE = .02076

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

group -.0112226 .005622 -2.00 0.051 -.0224989 .0000537


_cons .4115976 .0041521 99.13 0.000 .4032695 .4199256
Multiple indicator variables
• We are now interested in the four group comparison.
How could we do this?
– More than one indicator variable

• If we assume that the healthy controls are the


reference group, the equation is:
HIi = b0 + b1 * BMSi + b2 * SPMSi + b3 * PPMSi + e i

– BMS=1 if benign MS; BMS=0 otherwise


– SPMS=1 if secondary progressive MS; SPMS=0 otherwise
– PPMS=1 if primary progressive MS; PPMS=0 otherwise
Group means
• How does the equation look for each group?
– HIi = b0+ei for the healthy control group
– HIi = b0+b1+ei for the benign MS group
– HIi = b0+b2+ei for the SPMS group
– HIi = b0+b3+ei for the PPMS group
• b0 is the mean hypointensity for the healthy control
group
• b0+ b1 is the mean hypointensity for the BMS group
• b0 + b2 is the mean hypointensity for the SPMS
group
• b0 + b3 is the mean hypointensity for the PPMS group
Interpretation of coefficients
• What does b1 mean?
– Difference between the mean of benign MS and the
mean of healthy controls (𝛽̂1= −0.011)
• What does b2 mean?
– Difference between mean of SPMS and the mean of
the healthy controls (𝛽̂2=−0.018)
• What does b3 mean?
– Difference between mean of PPMS and the mean of
the healthy controls (𝛽̂3=−0.022)
• Under the null hypothesis, what do these equal?
Null hypothesis
• The first step in this analysis is a global test of
any group difference among the 4 groups

• The null hypothesis for this comparison using


one-way ANOVA was
– H0: meanHC=meanBMS=meanSPMS=meanPPMS

• Can we do a global test based on our linear


regression?
– What is the null hypothesis?
– If all the means were equal what would b1, b2, and
b3 equal?
Global test
• For the new null hypothesis that all of the
groups are the same, the null hypothesis is
based on three parameters
– For the groups to have equal mean: b1=b2=b3=0

• So far in linear regression we have focused on


hypothesis tests based on single parameters
• How can we complete this test using STATA?
STATA output
• There are two alternamve approaches to
express this hypothesis in STATA
• In each case, we write which coefficients are
equal to 0 in a test statement
• STATA provides the resulmng p-value
. test (bms=0) (spms=0) (ppms=0) . test bms=spms=ppms=0
( 1) bms = 0 ( 1) bms - spms = 0
( 2) spms = 0
( 3) ppms = 0
( 2) bms - ppms = 0
( 3) bms = 0
F( 3, 101) = 5.04
Prob > F = 0.0027 F( 3, 101) = 5.04
Prob > F = 0.0027

p-value for four group comparison p-value for four group comparison
Hypothesis test
1) H0: No difference in the mean HI between groups:
b1 = b2 = b3 =0
2) Continuous outcome, categorical predictor
3) Linear regression
4) Test statistic: F=5.04 (3,101 dof)
5) p-value=0.0027
6) Since the p-value is less than 0.05, we reject the
null hypothesis
7) We conclude that there is a significant difference in
the mean HI between the groups
Comparison to ANOVA
• As we know, we could have investigated this
same hypothesis using one-way ANOVA
Hypothesis test
1) H0: No difference in the mean HI between groups:
µHC = µBMS = µSPMS = µPPMS
2) Continuous outcome, categorical predictor
3) ANOVA test
4) Test statistic: F=5.04 (3,101 dof)
5) p-value=0.0027
6) Since the p-value is less than 0.05, we reject the
null hypothesis
7) We conclude that there is a significant difference in
the mean HI between the groups
STATA output
. oneway hypo group,tab

group
Summary of hypo
Mean Std. Dev. Freq.
Estimated group
0 .41159755 .01696495 25
means
1 .40037495 .02344147 30
2 .39322455 .02235947 28
3 .38937768 .02218831 22

Total .39883603 .02271744 105

Analysis of Variance
Source SS df MS F Prob > F

Between groups .006992274 3 .002330758 5.04 0.0027


p-value
Within groups .046680269 101 .000462181

Total .053672543 104 .000516082

Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415

Note that the sample means and p-value are the same as we
obtained from our regression analysis

AWESOME!!
Pairwise comparisons
• Since we found a significant difference among
the groups, we would like to know which
groups were significantly different. Therefore,
we would like to test if the differences
between the groups are statistically significant
• Using the regression output, we can calculate
these p-values unadjusted for multiple
comparisons
. regress hypo bms spms ppms

Source SS df MS Number of obs = 105


F( 3, 101) = 5.04
Model .006992274 3 .002330758 Prob > F = 0.0027
Residual .046680269 101 p-value for H0:
.000462181 R-squared = 0.1303
Adj R-squared
Estimate
Total for b 2
.053672543 104 b2=0
.000516082 Root MSE 95% CI
=
=
0.1044
.0215

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263


spms -.018373 .0059155 -3.11 0.002 -.0301079 -.0066381
ppms -.0222199 .0062845 -3.54 0.001 -.0346867 -.009753
_cons .4115976 .0042997 95.73 0.000 .4030681 .420127

Ho: No difference in the mean HI between healthy controls and


SPMS, β2=0
. regress hypo bms spms ppms

Source SS df MS Number of obs = 105


F( 3, 101) = 5.04
Model .006992274 3 .002330758 Prob > F = 0.0027
Residual .046680269 101
p-value for H0:
.000462181 R-squared
Adj R-squared
=
=
0.1303
0.1044
Estimate
Total for b3 43
.0536725 104
b3=0
.000516082 Root MSE
95% CI
= .0215

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263


spms -.018373 .0059155 -3.11 0.002 -.0301079 -.0066381
ppms -.0222199 .0062845 -3.54 0.001 -.0346867 -.009753
_cons .4115976 .0042997 95.73 0.000 .4030681 .420127
. test bms=spms

( 1) bms - spms = 0

F( 1, 101) = 1.60
Prob > F = 0.2085

p-value for H0:


Estimate for b1-b2
. lincom bms-spms
b1-b2=0 95% CI
( 1) bms - spms = 0

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .0071504 .0056491 1.27 0.209 -.0040559 .0183567


Pairwise t-test
• Here are the pairwise t-test results

Group 1 Group 2 Mean Unadjusted Adjusted p-


difference p-value value
HC BMS -0.011 0.057 0.34
HC SPMS -0.018 0.0025 0.015 multiply the
p value for

0.004
the # of
HC PPMS -0.022 0.00062 comparisons

BMS SPMS -0.007 0.21 1.00


BMS PPMS -0.011 0.071 0.43
SPMS PPMS -0.0038 0.53 1.00
STATA output
. oneway hypo group, tabulate bonferroni

Summary of hypo
group Mean Std. Dev. Freq.

0 .41159755 .01696495 25
1 .40037495 .02344147 30
2 .39322455 .02235947 28
3 .38937768 .02218831 22

Total .39883603 .02271744 105

Analysis of Variance
Source SS df MS F Prob > F

Between groups .006992274 3 .002330758 5.04 0.0027


Within groups .046680269 101 .000462181

Total .053672543 104 .000516082

Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415

Comparison of hypo by group


(Bonferroni)
Row Mean-
Col Mean 0 1 2

1 -.011223 Pairwise group


differences and
0.340

2 -.018373 -.00715
0.015 1.000 Bonferroni
3 -.02222
0.004
-.010997
0.428
-.003847
1.000
corrected p-values
Conclusion
• Indicator variables can be used to represent
dichotomous variables in a regression equation

• Interpretation of the coefficient for an indicator


variable is the same as for a continuous variable
– Provides a group comparison

• When we have more than two groups, we can do


global comparisons and pairwise comparisons
using the regression output

• ALWAYS WRITE DOWN THE MODEL


What we learned
• At the end of this lecture, you will be able to:

– Interpret regression coefficients for indicator


variables

– Compare linear regression to a t-test and ANOVA

You might also like