Professional Documents
Culture Documents
Multiple Linear Regression
Multiple Linear Regression
Data Mining
Contents
2.1. A Review of Multiple Linear Regression
2.2. Illustration of the Regression Process
2.3. Subset Selection in Linear Regression
Chap. 2
2.1
In this section, we review briey the multiple regression model that you encountered in the DMD course. There is a continuous random variable called the
dependent variable, Y , and a number of independent variables, x1 , x2 , . . . , xp .
Our purpose is to predict the value of the dependent variable (also referred to
as the response variable) using a linear function of the independent variables.
The values of the independent variables(also referred to as predictor variables,
regressors or covariates) are known quantities for purposes of prediction, the
model is:
Y = 0 + 1 x1 + 2 x2 + + p xp + ,
(2.1)
Sec. 2.1
tted (predicted) values at the observed values in the data. The sum of squared
dierences is given by
n
i=1
Let us denote the values of the coecients that minimize this expression by
0 , 1 , 2 , . . . , p . These are our estimates for the unknown values and are called
OLS (ordinary least squares) estimates in the literature. Once we have computed the estimates 0 , 1 , 2 , . . . , p we can calculate an unbiased estimate
2
for 2 using the formula:
2 =
n
1
(yi 0 1 xi1 2 xi2 . . . p xip )2
n p 1 i=1
Chap. 2
2.2
Sec. 2.3
each variable is 25 and the maximum value is 125. These ratings are answers
to survey questions given to a sample of 25 clerks in each of 30 departments.
The purpose of the analysis was to explore the feasibility of using a questionnaire for predicting eectiveness of departments thus saving the considerable
eort required to directly measure eectiveness. The variables are answers to
questions on the survey and are described below.
Y Measure of eectiveness of supervisor.
X1 Handles employee complaints
X2 Does not allow special privileges.
X3 Opportunity to learn new things.
X4 Raises based on performance.
X5 Too critical of poor performance.
X6 Rate of advancing to better jobs.
The multiple linear regression estimates as computed by the StatCalc addin to Excel are reported in Table 2.2. The equation to predict performance is
Y = 13.182 + 0.583X1 0.044X2 + 0.329X3 0.057X4 + 0.112X5 0.197X6.
In Table 2.3 we use ten more cases as the validation data. Applying the previous
equation to the validation data gives the predictions and errors shown in Table
2.3. The last column entitled error is simply the dierence of the predicted
minus the actual rating. For example for Case 21, the error is equal to 44.4650=-5.54
We note that the average error in the predictions is small (-0.52) and so
the predictions are unbiased. Further the errors are roughly Normal so that this
model gives prediction errors that are approximately 95% of the time within
14.34 (two standard deviations) of the true value.
2.3
Chap. 2
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Y
43
63
71
61
81
43
58
71
72
67
64
67
69
68
77
81
74
65
65
50
X1
51
64
70
63
78
55
67
75
82
61
53
60
62
83
77
90
85
60
70
58
X2
30
51
68
45
56
49
42
50
72
45
53
47
57
83
54
50
64
65
46
68
X3
39
54
69
47
66
44
56
55
67
47
58
39
42
45
72
72
69
75
57
54
X4
61
63
76
54
71
54
66
70
71
62
58
59
55
59
79
60
79
55
75
64
X5
92
73
86
84
83
49
68
66
83
80
67
74
63
77
77
54
79
80
85
78
X6
45
47
48
35
47
34
35
41
31
41
34
41
25
35
46
36
63
60
46
52
Sec. 2.3
Multiple R-squared
Residual SS
Std. Dev. Estimate
Constant
X1
X2
X3
X4
X5
X6
Coefficient
13.182
0.583
-0.044
0.329
-0.057
0.112
-0.197
0.656
738.900
7.539
StdError
16.746
0.232
0.167
0.219
0.317
0.196
0.247
t-statistic
0.787
2.513
-0.263
1.501
-0.180
0.570
-0.798
p-value
0.445
0.026
0.797
0.157
0.860
0.578
0.439
Estimates of regression coecients are likely to be unstable due to multicollinearity in models with many variables. We get better insights into the
inuence of regressors from models with fewer variables as the coecients
are more stable for parsimonious models.
It can be shown that using independent variables that are uncorrelated
with the dependent variable will increase the variance of predictions.
It can be shown that dropping independent variables that have small
(non-zero) coecients can reduce the average error of predictions.
Let us illustrate the last two points using the simple case of two independent variables. The reasoning remains valid in the general situation of more
than two independent variables.
2.3.1
Suppose that the true equation for Y, the dependent variable, is:
Y = 1 X1 +
(2.2)
Chap. 2
Case
21
22
23
24
25
26
27
28
29
30
Averages:
Std Devs:
Y
50
64
53
40
63
66
78
48
85
82
X1
40
61
66
37
54
77
75
57
85
82
X2
33
52
52
42
42
66
58
44
71
39
X3
34
62
50
58
48
63
74
45
71
59
X4
43
66
63
50
66
88
80
51
77
64
X5
64
80
80
57
75
76
78
83
74
78
X6
33
41
37
49
33
72
49
38
55
39
Prediction
44.46
63.98
63.91
45.87
56.75
65.22
73.23
58.19
76.05
76.10
62.38
11.30
Error
-5.54
-0.02
10.91
5.87
-6.25
-0.78
-4.77
10.19
-8.95
-5.90
-0.52
7.17
(2.3)
V ar(1 ) =
2
2 ) n x2
(1 R12
i=1 i1
E(2 ) = 0,
V ar(2 ) =
2
2 ) n x2 ,
(1 R12
i=1 i2
Sec. 2.3
The variance is the expected value of the squared error for an unbiased
estimator. So we are worse o using the irrelevant estimator in making predic2 = 0 and the
tions. Even if X2 happens to be uncorrelated with X1 so that R12
variance of 1 is the same in both models, we can show that the variance of a
prediction based on Model (3) will be worse than a prediction based on Model
(2) due to the added variability introduced by estimation of 2 .
Although our analysis has been based on one useful independent variable
and one irrelevant independent variable, the result holds true in general. It is
always better to make predictions with models that do not include
irrelevant variables.
2.3.2
Suppose that the situation is the reverse of what we have discussed above,
namely that Model (3) is the correct equation, but we use Model (2) for our
estimates and predictions ignoring variable X2 in our model. To keep our results
simple let us suppose that we have scaled the values of X1 , X2 , and Y so that
their variances are equal to 1. In this case the least squares estimate 1 has the
following expected value and variance:
E(1 ) = 1 + R12 2 ,
V ar(1 ) = 2 .
Notice that 1 is a biased estimator of 1 with bias equal to R12 2 and its
Mean Square Error is given by:
M SE(1 ) = E[(1 1 )2 ]
= E[{1 E(1 ) + E(1 ) 1 }2 ]
= [Bias(1 )]2 + V ar(1 )
= (R12 2 )2 + 2 .
If we use Model (3) the least squares estimates have the following expected
values and variances:
E(1 ) = 1 ,
V ar(1 ) =
2
2 ),
(1 R12
E(2 ) = 2 ,
V ar(2 ) =
2
2 ).
(1 R12
10
Chap. 2
isunbiased
= V ar(u1 1 + u2 2 ) + 2 ,
because now Y
= u2 V ar(1 ) + u2 V ar(2 ) + 2u1 u2 Covar(1 , 2 )
1
Model (2) can lead to lower mean squared error for many combinations
of values for u1 , u2 , R12 , and (2 /)2 . For example, if u1 = 1, u2 = 0, then
M SE2(Y ) < M SE3(Y ), when
(R12 2 )2 + 2 <
i.e., when
2
2 ),
(1 R12
1
|2 |
<
.
2
1 R12
Sec. 2.3
2.3.3
11
2 (S {i})
2 (S)
If Fi < Fout , where Fout is a threshold (typically between 2 and 4) then
drop i from S.
3. Repeat 2 until no variable can be dropped.
12
Chap. 2
Backward Elimination has the advantage that all variables are included in
S at some stage. This addresses a problem of forward selection that will never
select a variable that is better than a previously selected variable that is strongly
correlated with it. The disadvantage is that the full model with all variables is
required at the start and this can be time-consuming and numerically unstable.
Step-wise Regression
This procedure is like Forward Selection except that at each step we consider
dropping variables as in Backward Elimination.
Convergence is guaranteed if the thresholds Fout and Fin satisfy: Fout <
Fin . It is possible, however, for a variable to enter S and then leave S at a
subsequent step and even rejoin S at a yet later step.
As stated above these methods pick one best subset. There are straightforward variations of the methods that do identify several close to best choices
for dierent sizes of independent variable subsets.
None of the above methods guarantees that they yield the best subset
for any criterion such as adjusted R2 . (Dened later in this note.) They are
reasonable methods for situations with large numbers of independent variables
but for moderate numbers of independent variables the method discussed next
is preferable.
All Subsets Regression
The idea here is to evaluate all subsets. Ecient implementations use branch
and bound algorithms of the type you have seen in DMD for integer programming to avoid explicitly enumerating all subsets. (In fact the subset selection
problem can be set up as a quadratic integer program.) We compute a criterion
2 , the adjusted R2 for all subsets to choose the best one. (This is
such as Radj
only feasible if p is less than about 20).
2.3.4
The All Subsets Regression (as well as modications of the heuristic algorithms)
will produce a number of subsets. Since the number of subsets for even moderate
values of p is very large, we need some way to examine the most promising
subsets and to select from them. An intuitive metric to compare subsets is R2 .
However since R2 = 1 SSR
SST where SST , the Total Sum of Squares, is the
Sum of Squared Residuals for the model with just the constant term, if we use
it as a criterion we will always pick the full model with all p variables. One
approach is therefore to select the subset with the largest R2 for each possible
Sec. 2.3
13
n1
(1 R2 ).
nk1
Cp =
SSR
+ 2k n,
F2 ull
14
Chap. 2
SST= 2149.000
Fin= 3.840
Fout= 2.710
SSR
RSq
2
3
4
5
6
7
874.467
786.601
783.970
781.089
775.094
738.900
0.593
0.634
0.635
0.637
0.639
0.656
RSq
(adj)
0.570
0.591
0.567
0.540
0.511
0.497
Cp
-0.615
-0.161
1.793
3.742
5.637
7.000
X1
X1
X1
X1
X1
X1
X3
X3
X3
X2
X2
X6
X5 X6
X3 X5 X6
X3 X4 X5 X6
Models
1
2
Constant
Constant
Constant
Constant
Constant
Constant
X3
X2
X2
X2
X2
X3
X3 X4
X3 X4 X5
X3 X4 X5 X6
X1
X1
X1
X1
X1
X1
where
F2 ull is the estimated value of 2 in the full model that includes all the
variables. It is important to remember that the usefulness of this approach
depends heavily on the reliability of the estimate of 2 for the full model. This
requires that the training set contains a large number of observations relative to
the number of variables. We note that for our example only the subsets of size
6 and 7 seem to be unbiased as for the other models Cp diers substantially
from k. This is a consequence of having too few observations to estimate 2
accurately in the full model.