You are on page 1of 39

1.

2 Multiple Linear
Regression
Hector Lemus
Spring 2016

Multiple Linear Regression

Examine the relationship between a set of independent variables and a


single continuous dependent variable.

Uses of multiple linear regression:


1. To assess the relationship between the dependent and the independent
variables simultaneously taking into account the intercorrelations
among the independent variables.
2. To examine the effect of one or more variables on the dependent
variable after controlling (adjusting) for the effects of the other
variables in the model.
3. To assess the interaction of two or more independent variables with
respect to the dependent variable.
4. To develop a prediction equation.

Example

Dependent variable: Systolic blood pressure

Independent variables:
1. BMI
2. Age
3. Smoking history:
0 = Nonsmoker
1 = Current or Previous Smoker

Hypothetical Example

Carry out a clinical trial to examine the effectiveness of a drug to treat


hypertension.
Half of the patients are randomly assigned to an active drug and half
assigned to placebo.
The dependent variable is change in diastolic blood pressure from the
baseline evaluation to the 6-month evaluation.
Suppose we observe the following mean changes in DBP stratified by age:
Age
(years)

Drug Group
Active

Placebo

<60

-10

-2

60

-1

-2

Example of an interaction between drug and age. Differential effect of


active drug. Its effectiveness varies by age.
4

Hypothetical Example

Multiple linear regression may be used to assess the degree of interaction


and test whether the interaction is statistically significant.

Dependent variable: Change in DBP

Independent variables:
1. Age
2. Drug group (Active/Placebo)
3. Interaction term (to be discussed)

Multiple Linear Regression Model


Notation:
Let Y be the dependent variable
Let X1,, Xk be the independent variables

Model:

Y 0 1 X 1 2 X 2 L k X k E
k

0 i X i E
i 1

where 0, 1,, k are regression coefficients to be estimated and E is


the error term which is a random variable

E has a distribution and for testing hypotheses we need to make an


assumption about its distribution.
6

Assumptions for MLR


1. Y is a random variable with distribution of values for each specific

combination of values of the Xs.

2. The observations of Y are statistically independent of each other.


3. The mean value of Y for each specific combination of the Xs is given by

0 1 X 1 2 X 2 L k X k
4. The variance of Y is the same for any fixed combination of the Xs.

For hypothesis testing, we need one more assumption:


5. Y is normally distributed for each specific combination of the Xs.
7

Estimating with Least Squares


Basic Idea: Find estimates of the s which minimizes the sum of the squared
distances between the observed and corresponding predicted values.

X X L X
Y
0
1 1
2 2
k
k
Let the predicted value be
Find the estimated parameters

i 1

Yi Yi

i 1

0 , 1 ,K , k

which minimizes

Yi 0 1 X 1i L k X ki

This quantity defines the error sum of squares denoted by SSE. It is


also called the residual sum of squares.

The difference between the observed and predicted value of Y Y yields


an estimate for E.
E Y Y

is called the residual.


8

Computing the Parameter Estimates

Details for computing the s will NOT be discussed.

Requires matrix algebra and calculus.

ANOVA Table for MLR


n

SSY Yi Y

SSE

i 1

i 1

SS Reg SS Res

SSY SSE

SSY SSE SSE

i 1

Source

df

SS

Regression

SSY SSE

nk1

SSE

n1

SSY

Residual
Total

Yi Yi

Yi Y

MS
SSY SSE
MSReg
k
SSE
MS Res
n k 1

F
MSReg
F
MS Res

10

Coefficient of Determination
Proportion of variability of Y that can be explained by the model
R2

SSY SSE
SSY

0 R2 1

If R2 = 1, we have a perfect fit. The model explains all of the variability.

If R2 = 0, the model explains none of the variability.

11

Testing Hypotheses in MLR


Three test types:
1. Overall test: Does the set of independent variables taken together explain a
significant amount of the variability in Y?

Taken together, does the set of BMI, Age and Smoking History explain a
significant amount of the variability of SBP?

2. Test for addition of a single variable: Given a set of independent variables in

the model, does the addition of one variable explain a significant amount of the
variability of Y?

Evaluate the relationship between one independent variable and Y after controlling
(adjusting) for the other variables in the model.
Given that Age and Smoking History are in the model, what is the relationship
between BMI and SBP?

3. Test for the addition of a group of variables: Given a set of independent

variables in the model, does the addition of another set of variables explain a
significant amount of the variability of Y?

Assess the relationship of a set of behavioral variables measuring stress on DBP


adjusting for known factors related to DBP such as Age and BMI.
12

Nested Models: The Full

A full model contains all of the variables of interest.

For example:

Suppose we test the association between BMI and SBP after adjusting for
Age and Smoking History

SBP 0 1 (BMI) 2 (Age) 3 (Smoking History) E

H 0 : 1 0

13

Nested Models: The Reduced

If H0 is true, then the most appropriate model is


SBP 0 2 (Age) 3 (Smoking History) E

This is the reduced model when H0 is true.

Testing H0 is equivalent to testing which of the two models is most


appropriate.

Note that no new variables are introduced in the reduced model.

The concepts of nested (full and reduced) models will apply to all of the
tests that we discuss.
14

Test for Overall Regression

The full model:

Have k independent variables.

Three ways of stating the same null hypothesis


1.
H0: The k independent variables taken together do not explain a
significant amount of the variability in Y.
2.
H0: The overall regression using the k independent variables is not
statistically significant.
3.
H0: 1 = 2 = = k = 0

The reduced model:

Y 0 1 X 1 2 X 2 L k X k E

Y 0 E
15

The Test Statistic

Use the F statistic from the ANOVA table


F

MS Reg
MS Res

When H0 is true F ~ F-dist with k and n-k-1 degrees of freedom

Reject H0 for large values of F

Fk, n-k-1,
n-k-1, 1-
1-: the 100(1 - ) percentile from the F-dist with k and n-k-1 degrees
of freedom, where is our chosen level of significance.

Decision rule: Reject H0 if F > Fk, n-k-1,


n-k-1, 1-
1-
The percentile is the critical value or critical point.
Alternatively, compute the p-value and compare to the level.

16

SBP Example

Determine whether BMI, age and smoking history taken together account
for a significant amount of the variability of SBP.
Y: SBP, X1: BMI, X2: Age, X3: Smoking History
n = 32 subjects k = 3
Full model:

Y 0 1 X 1 2 X 2 3 X 3 E

H 0 : 1 = 2 = 3 = 0
Reduced model:

Y 0 E

Under H0, F follows F-dist with k = 3 and n-k-1 = 28 df.


17

SAS Output
The REG Procedure
Model: MODEL1
Dependent Variable: SBP Systolic Blood Pressure (mmHg)
Number of Observations Read
Number of Observations Used

At = 0.05,
F3, 28, 0.95 = 2.95

32
32

Analysis of Variance

Source

DF

Model
Error
Corrected Total

3
28
31

Root MSE
Dependent Mean
Coeff Var

Sum of
Squares
4889.82570
1536.14305
6425.96875

7.40691
144.53125
5.12478

Mean
Square
1629.94190
54.86225

R-Square
Adj R-Sq

F Value
29.71

0.7609
0.7353

Pr > F

At = 0.01,
F3, 28, 0.99 = 4.57

<.0001

At = 0.001,
F3, 28, 0.999 = 7.19

Reject H0 and conclude that taken together the 3 variables account for a
significant amount of the variability of SBP.
18

The Partial F Test

The regression sum of squares must be partitioned into components that


can be used to test hypotheses about individual variables
One type of breakdown is sequential, variables-added-in-order
Called Type I in SAS
X1: BMI, X2: Age, X3: Smoking History
Source

df

SS

X1
1 3537.95

Regression X 2 | X 1
1 582.65
X | X , X 1 769.23
3
1
2

Residual

28 1536.14

19

SS(X1)

The sum of squares explained by using only X1 in the model.


This may be used to test whether BMI is linearly related to SBP without
adjusting for any other variables.

Since, technically, X2 and X3 are not in the model, then pool their terms
with the residual.
SSRes = 1536.14 + 582.65 + 769.23 = 2888.02
dfRes = 28 + 1 + 1 = 30

So given the full model:


Test H0: 1 = 0 using

Y 0 1 X 1 E

3537.95 / 1 3537.95

36.75
2888.02 / 30
96.28
20

SS(X2|X1)

The extra sum of squares explained by adding Age to the model given BMI
already in the model.
Pooled error term:
SSRes = 1536.14 + 769.23 = 2305.37
dfRes = 28 + 1 = 29
Full:

Y 0 1 X 1 2 X 2 E

Reduced: Y 0 1 X 1 E

H0: 2 = 0 [Age is not related to SBP after adjusting for BMI.]


F

582.65 / 1
582.65

7.33
2305.37 / 29 79.50

21

SS(X3|X1, X2)

The extra sum of squares explained by adding Smoking history to the


model given BMI and Age already in the model.
Full:

Y 0 1 X 1 2 X 2 3 X 3 E

Reduced: Y 0 1 X 1 2 X 2 E

H0: 3 = 0 [Smoking history is not associated with SBP after adjusting for
BMI and Age.]
F

769.23 /1
769.23

14.02
1536.14 / 28 54.86

22

General Partial F Test


Y 0 1 X 1 2 X 2 L p X p * X * E

Full model:

H0: The addition of X* to the model does not explain a significant amount
of the variability of Y in the presence of X1, X2, , Xp.

H0: X* is not significantly related to Y controlling for X1, X2, , Xp.

H 0 : * = 0

Reduced model:

Y 0 1 X 1 2 X 2 L p X p E

23

Construction of the Test

To construct the partial F test, you need the extra sum of squares for X*.
Denote:
SS(X*| X1, X2, , Xp) = RegSS(X1, X2, , Xp, X*) RegSS(X1, X2, , Xp)
= RegSS(Full) RegSS(Reduced)
MSRes Full

We also need the MSRes for the full model:

So,

F X * | X 1 ,..., X p

SS X * | X 1 ,..., X p

SSRe s (Full)
n p2

MSRes (Full)

The statistic follows an F-dist with 1 and n-p-2 df

Reject the H0 if F(X*| X1, X2, , Xp) > F1, n-p-2,


n-p-2, 1-
1-
24

Example 1

Test whether smoking history is related to SBP after controlling for Age and
BMI.
Y 0 1 X 1 2 X 2 3 X 3 E
Full model:
H0: 3 = 0

From the table:


SS(X3|X1, X2) = 769.23
MSRes(X1, X2, X3) = 1536.14/28 = 54.86
F = 769.23/54.86 = 14.02

Since F1,28,0.999 = 13.5 p-value < 0.001


Reject H0 and conclude Smoking history is significantly related to SBP after
adjusting for BMI and Age.
25

Example 2

Test the relationship of BMI to SBP controlling for Age and Smoking history.

H 0 : 1 = 0

Full:

Y 0 1 X 1 2 X 2 3 X 3 E

Reduced: Y 0 2 X 2 3 X 3 E

We need SS(X1|X2, X3), but not available in the table.

Note that SS(X1|X2, X3) = SS(X1, X2, X3) SS(X2, X3)

We know that SS(X1, X2, X3) = 4889.83 from the SAS Output.

However, we would have to find SS(X2, X3) by fitting a model with only X2
and X3 in it.

It turns out SS(X2, X3) = 4689.69


26

Example 2 (cont.)
SS(X1|X2, X3) = 4889.83 4689.69 = 200.14
This is the marginal sum of squares, SAS can provide this information.
F(X1|X2, X3) = 200.14/54.86 = 3.65
F1, 28, 0.90 = 2.89
F1, 28, 0.95 = 4.20

0.05 < p-value < 0.10

Fail to reject H0 at = 0.05.


No evidence to suggest a significant relationship between SBP and BMI
adjusting for Age and Smoking history.
27

A T-test Equivalent
An equivalent test to the Partial F test.
*
*
Full model: Y 0 1 X 1 2 X 2 L p X p X E

Test: H0: * = 0
Could use F(X*| X1, X2, , Xp) or equivalently
*

where
is the estimated regression parameter
s
and * is the estimated standard error.

*
T
s *

For a two-sided Test:


Reject H0 if |T| > tn-p-2,1/2
n-p-2,1-
28

Example 2 (again)
Relationship of BMI to SBP adjusting for Age and Smoking History.
Parameter Estimates

Variable

Label

Intercept
BMI
AGE
SMK

Intercept
Body Mass Index
Age (years)
Smoking History

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

1
1
1
1

45.10319
1.22225
1.21271
9.94557

10.76488
0.63993
0.32382
2.65606

4.19
1.91
3.75
3.74

0.0003
0.0664
0.0008
0.0008

1.2223
1.91,
0.6399

p value 0.066

F = T2 = (1.91)2 = 3.65

29

Partitioning the RegSS


1.

SS ( X 1 )

SS ( X 2 | X 1 )
SS ( X 3 | X 1 , X 2 )

Leads to variables-added-in-order or sequential


testing.
This is SAS Type 1 SS.

Useful if there is an ordering to the independent variables.


2. SS ( X 1 | X 2 , X 3 )

SS ( X 2 | X 1 , X 3 )
SS ( X 3 | X 1 , X 2 )

Leads to variables-added-last or marginal testing.


Each test adjusts for all other variables in the
model.
This is SAS Type 2 SS.

With the exception of the last test, these tests are not equivalent.

30

SAS Code and Output


proc reg data=sbp_data;
model sbp = bmi age smk / ss1 ss2;
run;quit;

Parameter Estimates

Variable

Label

Intercept
BMI
AGE
SMK

Intercept
Body Mass Index
Age (years)
Smoking History

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

1
1
1
1

45.10319
1.22225
1.21271
9.94557

10.76488
0.63993
0.32382
2.65606

4.19
1.91
3.75
3.74

0.0003
0.0664
0.0008
0.0008

Parameter Estimates
Variable

Label

Intercept
BMI
AGE
SMK

Intercept
Body Mass Index
Age (years)
Smoking History

DF

Type I SS

Type II SS

1
1
1
1

668457
3537.94574
582.64651
769.23345

963.09739
200.14147
769.45920
769.23345

31

MLR Table
Multiple Linear Regression of Systolic Blood Pressure versus selected characteristics (n
(n = 32)
Characteristic

Estimated Coefficient

95% Confidence Interval

p-value

BMI (kg/m2)

1.2

-0.1, 2.5

0.066

Age (5 yr interval)

6.1

2.7, 9.4

<0.001

Smoking History

9.9

4.5, 15.4

<0.001

R2 = 0.76

32

Multiple Partial F Test


Given that a set of independent variables is in the model, test for the addition
of another set.
Uses:
1. The additional set represents a related group of variables; test a set of
behavioral variables controlling for a set of demographic variables.
2.

Test a set of interactions.

3.

Assess the relationship of a categorical variable with 3 or more categories.

33

Generalization of Partial F Test


Y 0 1 X 1 L p X p p* 1 X *p 1 L k* X k* E

Full model:

*
*
H0: The addition of Xp+1
p+1 , , Xk to the model does not explain a
significant amount of the variability of Y in the presence of X1, X2, , Xp.

*
*
H0: The set of Xp+1
p+1 , , Xk is not significantly related to Y controlling for
X1, X2, , Xp.

*
*
H0: p+1
p+1 = = k = 0

Reduced model:

Y 0 1 X 1 L p X p E

34

Construction of the Test

*
*
Need the extra sum of squares from adding Xp+1
p+1 , , Xk to the model.
Denote:
*
*
SS(Xp+1
p+1 , , Xk | X1, X2, , Xp) = RegSS(Full) RegSS(Reduced)

F X

*
p 1

,..., X | X 1 ,..., X p
*
k

SS X *p 1 ,..., X k* | X 1 ,..., X p / k p

So,

The statistic follows an F-dist with k-p and n-k-1 df

*
*
Reject the H0 if F(Xp+1
p+1 , , Xk | X1, X2, , Xp) > Fk-p, n-k-1,
n-k-1, 1-
1-

MSRes (Full)

35

Example: SBP Data


Test for a set of interactions.
Let X1 = BMI
X2 = Age
X3 = Smoking History
X4 = BMIAge interaction
X5 = BMISmoking History interaction
X6 = AgeSmoking History interaction
Full model: Y 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 6 X 6 E
H 0 : 4 = 5 = 6 = 0
Reduced model:

Y 0 1 X 1 2 X 2 3 X 3 E
36

Example: SBP Data (cont.)


ANOVA for the full model:
Source

SS

df

ANOVA for the reduced model:

MS

Source
Regression

4889.83

1629.94

Residual

1536.14

28

54.86

Regression

5092.83

848.80

Residual

1333.14

25

53.33

SS

df

MS

SS(X4, X5, X6 | X1, X2, X3) = 5092.83 4889.83 = 203.00


F ( X 4 , X 5 , X 6 | X1, X 2 , X 3 )

203.00 / 3
1.27
53.33

Under H0, F follows an F-dist with 3, 25 df


p-value > 0.25 Fail to reject H0.
The interactions taken together do not explain a significant amount of the
variability of SBP.
37

Constructing Extra SS

Suppose we have:
SS(X1)
SS(X2|X1)
SS(X3|X1, X2)

Suppose we have the full model: Y 0 1 X 1 2 X 2 3 X 3 E


and we want to test H0: 2 = 3 = 0.

So we need SS(X2, X3 | X1) which does not appear in the table.

SS(X2, X3 | X1) is the extra sum of squares explained by adding X2 and X3 to


the model given X1 already in the model.
SS(X2, X3 | X1) = SS(X1, X2, X3) SS(X1)
38

Rewriting the Extra SS


SS(X2 | X1) = SS(X1, X2) SS(X1)
SS(X3 | X1, X2) = SS(X1, X2, X3) SS(X1, X2)
Therefore,
SS(X2 | X1) + SS(X3 | X1, X2) = SS(X1, X2) SS(X1) + SS(X1, X2, X3) SS(X1, X2)
= SS(X1, X2, X3) SS(X1)
= SS(X2, X3 | X1)

39