Regresión Lineal - II - 14

MED-0323: Introducción a la Bioestadística
Clase 14: Regresión lineal II
Noviembre 15, 2019

Iván Sisa, MD, MPH, MS
isisa@usfq.edu.ec
Linear regression
Farzad Noubary, PhD

Lecture outline
• Definition of indicator variables
• Interpretation of regression coefficients for
indicator variables
• Relationship between linear regression and t-
test
• Relationship between linear regression and
ANOVA
Goals of lecture
• At the end of this lecture, you will be able to:
– Interpret regression coefficients for indicator
variables
– Compare linear regression to a t-test and ANOVA
Previous classes
• Simple linear regression
– Continuous outcome (required), continuous
covariate
– Relationship to correlation
Dichotomous predictors
Big picture
• One of the biggest advantages of linear regression is
that we can incorporate continuous, dichotomous,
and categorical covariates/predictors/explanatory
variables
Linear regression with dichotomous
predictor
• In the previous example, we demonstrated that
there was a significant association between age and
BPF in our sample of MS patients
• We might be also interested if there was an effect of
gender on BPF in MS patients
• To do this, we could use an indicator variable, which
equals 1 for male and 0 for female. The resulting
regression equation for BPF is
E(BPF | male) = b0 + b1 *male
BPFi = b0 + b1 * malei + e i
Graph
.95
- - -
.9
- - - - -
.85
BPF
- - -
.8
- - - - -
.75
F M
gender
• The regression equation can be rewritten as
BPFfemale = b0 + e i
BPFmale = b0 + b1 + e i
• What is the meaning of the coefficients in this case?
– b0 is the mean BPF when male=0
• The females
– b0 + b1 is the mean BPF when male=1
• The males
• What is the interpretation of b1?
– For a one-unit increase in sex, there is a b1 increase in
mean of the BPF (needs to be interpreted in terms how
the variable was coded in the dataset)
– The difference in mean BPF between the males and
females
. regress bpf male
Source SS df MS Number of obs = 29

F( 1, 27) = 3.33
Model .007323547 1 .007323547 Prob > F = 0.0792
Estimated slope,
Residual
b1
.059426595 27 .002200985 R-squared
Adj R-squared
=
=
0.1097
0.0767
Total .066750142 28 .002383934 Root MSE = .04691
bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
male .0371364 .0203586 1.82 0.079 -.004636 .0789087

_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865
Estimated intercept, b0
BPFˆ = bˆ0 + bˆ1 *male
BPFˆ = 0.823 + bˆ1 *male
BPFˆ = 0.823 + 0.037*male

Interpretation of results
• The final regression equation is
BPFˆ = 0.823 + 0.037 * male
• The meaning of the coefficients in this case are
– 0.823 is the estimate of the mean BPF in the female
group
– 0.037 is the estimate of the mean increase in BPF
between the males and females
– What is the estimated mean BPF in the males?
• 0.86
• How could we test if the difference between the
groups is statistically significant?
Hypothesis test
1) H0: There is no difference based on gender (b1 =0)
2) Continuous outcome, dichotomous predictor
3) Linear regression
4) Test statistic: t=1.82 (27 dof)
5) p-value=0.079
6) Since the p-value is more than 0.05, we fail to
reject the null hypothesis
7) We conclude that there is not enough evidence to
suggest that there is a significant difference in the
mean BPF in males compared to females
. regress bpf male

F( 1, 27) = 3.33
Model .007323547 1 .007323547 Prob > F = 0.0792
Estimated slope, b1
Residual .059426595 27 p-value for H0: b1=0
.002200985 R -squa red = 0.1097
95% confidence interval
A dj R- squared = 0.0767
Total .066750142 28 .002383934 Root MSE = .04691
bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
male .0371364 .0203586 1.82 0.079 -.004636 .0789087

_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865
T-test
• As you hopefully remember, you could have
tested this same null hypothesis using a two
sample t-test
• Linear regression makes an equal variance
assumption, so let’s use the same assumption
for our t-test
. ttest bpf, by(male) Sample mean in females
Two-sample t test with equal variances
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
Sample mean
0
in males
22 .8228636 .0096717 .0453645 .8027502 .8429771
1 7 .86 .0196457 . 8119288 .9080712
95% CI for group difference
.0519775
combined 29 .8318276 .0090667 .0488255 .8132553 .8503998
diff -.0371364 .0203586 -.0789087 .004636
diff = mean(0) - mean(1)

Ho: diff = 0 Two-sideddegrees
p-value t
of fre edom
=
=
-1.8241
27
Estimated group
Ha: diff < 0difference Ha: diff != 0
r(|T| > |t|) = 0.0792
Ha: diff > 0
Pr(T > t) = 0.9604
Pr(T < t) = 0.0396 P
Note that the sample means and p-value for two group
comparison are the same as we obtained from our regression
analysis
Note that the estimated difference has the opposite sign

compared to the linear regression, but the magnitude is the same.
It is critical to remember the meaning of the sign
Hypothesis test
1) H0: There is no difference based on gender
2) Continuous outcome, dichotomous predictor
3) t-test
4) Test statistic: t= -1.82 (27 dof)
5) p-value=0.079
6) Since the p-value is more than 0.05, we fail to reject the
null hypothesis
7) We conclude that there is no significant difference in the
mean BPF in males compared to females
I prefer: We conclude that there is not enough evidence to

suggest that there is a significant difference in the mean
BPF in males compared to females
Amazing!!!
• We get the same result using both
approaches!!
• Why use linear regression?
– Allows more complex model through multiple
regression
– Extensions allow non-continuous outcomes
– Allows correlation among observations

Categorical predictor
• In our discussion of ANOVA, we investigated the
hypointensity of brain structures in healthy controls,
benign MS, SPMS and PPMS patients
• In this case, we had a categorical predictor and a

continuous outcome
• The question was whether there were any

differences between these groups
• How could we investigate this using linear

regression?
Review: indicator variables
• Let’s assume that our reference group is the healthy
controls
• If we were just interested in comparing the benign MS

patients to the healthy controls, we would use the
following regression equation
HIi = b0 + b1 * BMS + e i
– BMSi =0 for healthy controls
– BMSi =1 for benign MS patients
• What is the meaning of b1?
. regress hypo group if group<2

F ( 1, 53) = 3.98
Model .001717456 1 p-value for H0:
.00 1717456 P rob > F = 0.0511
Residua l .022842995 53 .000431 R -squared = 0.069 9
Estimate for b1 b1=0 95% CI
A dj R-square d = 0.052 4
Total .02456045 54 .000454823 Root MSE = .02076
hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]
group -.0112226 .005622 -2.00 0.051 -.0224989 .0000537

_cons .4115976 .0041521 99.13 0.000 .4032695 .4199256
Multiple indicator variables
• We are now interested in the four group comparison.
How could we do this?
– More than one indicator variable
• If we assume that the healthy controls are the

reference group, the equation is:
HIi = b0 + b1 * BMSi + b2 * SPMSi + b3 * PPMSi + e i
– BMS=1 if benign MS; BMS=0 otherwise

– SPMS=1 if secondary progressive MS; SPMS=0 otherwise
– PPMS=1 if primary progressive MS; PPMS=0 otherwise
Group means
• How does the equation look for each group?
– HIi = b0+ei for the healthy control group
– HIi = b0+b1+ei for the benign MS group
– HIi = b0+b2+ei for the SPMS group
– HIi = b0+b3+ei for the PPMS group
• b0 is the mean hypointensity for the healthy control
group
• b0+ b1 is the mean hypointensity for the BMS group
• b0 + b2 is the mean hypointensity for the SPMS
group
• b0 + b3 is the mean hypointensity for the PPMS group
Interpretation of coefficients
• What does b1 mean?
– Difference between the mean of benign MS and the
mean of healthy controls (𝛽̂1= −0.011)
– Difference between mean of SPMS and the mean of
the healthy controls (𝛽̂2=−0.018)
– Difference between mean of PPMS and the mean of
the healthy controls (𝛽̂3=−0.022)
• Under the null hypothesis, what do these equal?
Null hypothesis
• The first step in this analysis is a global test of
any group difference among the 4 groups
• The null hypothesis for this comparison using

one-way ANOVA was
– H0: meanHC=meanBMS=meanSPMS=meanPPMS
• Can we do a global test based on our linear

regression?
– What is the null hypothesis?
– If all the means were equal what would b1, b2, and
b3 equal?
Global test
• For the new null hypothesis that all of the
groups are the same, the null hypothesis is
based on three parameters
– For the groups to have equal mean: b1=b2=b3=0
• So far in linear regression we have focused on

hypothesis tests based on single parameters
• How can we complete this test using STATA?
STATA output
• There are two alternamve approaches to
express this hypothesis in STATA
• In each case, we write which coefficients are
equal to 0 in a test statement
• STATA provides the resulmng p-value
. test (bms=0) (spms=0) (ppms=0) . test bms=spms=ppms=0
( 1) bms = 0 ( 1) bms - spms = 0
( 2) spms = 0
( 3) ppms = 0
( 2) bms - ppms = 0
( 3) bms = 0
F( 3, 101) = 5.04
Prob > F = 0.0027 F( 3, 101) = 5.04
Prob > F = 0.0027
p-value for four group comparison p-value for four group comparison
Hypothesis test
1) H0: No difference in the mean HI between groups:
b1 = b2 = b3 =0
2) Continuous outcome, categorical predictor
3) Linear regression
4) Test statistic: F=5.04 (3,101 dof)
5) p-value=0.0027
6) Since the p-value is less than 0.05, we reject the
null hypothesis
7) We conclude that there is a significant difference in
the mean HI between the groups
Comparison to ANOVA
• As we know, we could have investigated this
same hypothesis using one-way ANOVA
Hypothesis test
1) H0: No difference in the mean HI between groups:
µHC = µBMS = µSPMS = µPPMS
2) Continuous outcome, categorical predictor
3) ANOVA test
4) Test statistic: F=5.04 (3,101 dof)
5) p-value=0.0027
6) Since the p-value is less than 0.05, we reject the
null hypothesis
7) We conclude that there is a significant difference in
the mean HI between the groups
STATA output
. oneway hypo group,tab
group
Summary of hypo
Mean Std. Dev. Freq.
Estimated group
0 .41159755 .01696495 25
means
1 .40037495 .02344147 30
2 .39322455 .02235947 28
3 .38937768 .02218831 22
Total .39883603 .02271744 105
Analysis of Variance
Source SS df MS F Prob > F
Between groups .006992274 3 .002330758 5.04 0.0027

p-value
Within groups .046680269 101 .000462181
Total .053672543 104 .000516082
Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415
Note that the sample means and p-value are the same as we
obtained from our regression analysis
AWESOME!!
Pairwise comparisons
• Since we found a significant difference among
the groups, we would like to know which
groups were significantly different. Therefore,
we would like to test if the differences
between the groups are statistically significant
• Using the regression output, we can calculate
these p-values unadjusted for multiple
comparisons
. regress hypo bms spms ppms

F( 3, 101) = 5.04
Model .006992274 3 .002330758 Prob > F = 0.0027
Residual .046680269 101 p-value for H0:
.000462181 R-squared = 0.1303
Adj R-squared
Estimate
Total for b 2
.053672543 104 b2=0
.000516082 Root MSE 95% CI
=
=
0.1044
.0215
bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263

spms -.018373 .0059155 -3.11 0.002 -.0301079 -.0066381
ppms -.0222199 .0062845 -3.54 0.001 -.0346867 -.009753
_cons .4115976 .0042997 95.73 0.000 .4030681 .420127
Ho: No difference in the mean HI between healthy controls and

SPMS, β2=0
. regress hypo bms spms ppms

F( 3, 101) = 5.04
Model .006992274 3 .002330758 Prob > F = 0.0027
Residual .046680269 101
p-value for H0:
.000462181 R-squared
Adj R-squared
=
=
0.1303
0.1044
Estimate
Total for b3 43
.0536725 104
b3=0
.000516082 Root MSE
95% CI
= .0215
bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263

spms -.018373 .0059155 -3.11 0.002 -.0301079 -.0066381
ppms -.0222199 .0062845 -3.54 0.001 -.0346867 -.009753
_cons .4115976 .0042997 95.73 0.000 .4030681 .420127
. test bms=spms
( 1) bms - spms = 0
F( 1, 101) = 1.60
Prob > F = 0.2085
p-value for H0:

Estimate for b1-b2
. lincom bms-spms
b1-b2=0 95% CI
( 1) bms - spms = 0
(1) .0071504 .0056491 1.27 0.209 -.0040559 .0183567

Pairwise t-test
• Here are the pairwise t-test results
Group 1 Group 2 Mean Unadjusted Adjusted p-

difference p-value value
HC BMS -0.011 0.057 0.34
HC SPMS -0.018 0.0025 0.015 multiply the
p value for
0.004
the # of
HC PPMS -0.022 0.00062 comparisons
BMS SPMS -0.007 0.21 1.00

BMS PPMS -0.011 0.071 0.43
SPMS PPMS -0.0038 0.53 1.00
STATA output
. oneway hypo group, tabulate bonferroni
Summary of hypo
group Mean Std. Dev. Freq.
0 .41159755 .01696495 25
1 .40037495 .02344147 30
2 .39322455 .02235947 28
3 .38937768 .02218831 22
Total .39883603 .02271744 105
Analysis of Variance
Source SS df MS F Prob > F
Between groups .006992274 3 .002330758 5.04 0.0027

Within groups .046680269 101 .000462181
Total .053672543 104 .000516082
Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415
Comparison of hypo by group

(Bonferroni)
Row Mean-
Col Mean 0 1 2
1 -.011223 Pairwise group

differences and
0.340
2 -.018373 -.00715
0.015 1.000 Bonferroni
3 -.02222
0.004
-.010997
0.428
-.003847
1.000
corrected p-values
Conclusion
• Indicator variables can be used to represent
dichotomous variables in a regression equation
• Interpretation of the coefficient for an indicator

variable is the same as for a continuous variable
– Provides a group comparison
• When we have more than two groups, we can do

global comparisons and pairwise comparisons
using the regression output
• ALWAYS WRITE DOWN THE MODEL

What we learned
• At the end of this lecture, you will be able to:
– Interpret regression coefficients for indicator

variables
– Compare linear regression to a t-test and ANOVA

Regresión Lineal - II - 14

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regresión Lineal - II - 14

Uploaded by

Copyright:

Available Formats

MED-0323: Introducción a la Bioestadística

Clase 14: Regresión lineal II

Noviembre 15, 2019

Farzad Noubary, PhD

Source SS df MS Number of obs = 29

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

male .0371364 .0203586 1.82 0.079 -.004636 .0789087

BPFˆ = 0.823 + bˆ1 *male

BPFˆ = 0.823 + 0.037*male

Source SS df MS Number of obs = 29

bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]

male .0371364 .0203586 1.82 0.079 -.004636 .0789087

combined 29 .8318276 .0090667 .0488255 .8132553 .8503998

diff -.0371364 .0203586 -.0789087 .004636

diff = mean(0) - mean(1)

Note that the estimated difference has the opposite sign

I prefer: We conclude that there is not enough evidence to

– Extensions allow non-continuous outcomes

– Allows correlation among observations

• In this case, we had a categorical predictor and a

• The question was whether there were any

• How could we investigate this using linear

• If we were just interested in comparing the benign MS

Source SS df MS Number of obs = 55

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

group -.0112226 .005622 -2.00 0.051 -.0224989 .0000537

• If we assume that the healthy controls are the

– BMS=1 if benign MS; BMS=0 otherwise

• The null hypothesis for this comparison using

• Can we do a global test based on our linear

• So far in linear regression we have focused on

Total .39883603 .02271744 105

Between groups .006992274 3 .002330758 5.04 0.0027

Total .053672543 104 .000516082

Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415

Source SS df MS Number of obs = 105

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263

Ho: No difference in the mean HI between healthy controls and

Source SS df MS Number of obs = 105

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

bms -.0112226 .0058218 -1.93 0.057 -.0227715 .0003263

p-value for H0:

hypo Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .0071504 .0056491 1.27 0.209 -.0040559 .0183567

Group 1 Group 2 Mean Unadjusted Adjusted p-

BMS SPMS -0.007 0.21 1.00

Total .39883603 .02271744 105

Between groups .006992274 3 .002330758 5.04 0.0027

Total .053672543 104 .000516082

Bartlett's test for equal variances: chi2(3) = 2.8540 Prob>chi2 = 0.415

Comparison of hypo by group

1 -.011223 Pairwise group

• Interpretation of the coefficient for an indicator

• When we have more than two groups, we can do

• ALWAYS WRITE DOWN THE MODEL

– Interpret regression coefficients for indicator

– Compare linear regression to a t-test and ANOVA

You might also like