Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Standard view
Full view
of .
×
0 of .
Results for:
P. 1
Lecture 17

# Lecture 17

Ratings: 0|Views: 261|Likes:

### Availability:

See More
See less

11/02/2010

pdf

text

original

36-350, Data Mining21 October 2009
Contents
1.1 Collinearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Estimating the Optimal Linear Predictor. . . . . . . . . . . . . 3
2.1 Changing Slopes. . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1
: Distraction or Nuisance?. . . . . . . . . . . . . . . . 62.2 Omitted Variables and Shifting Distributions. . . . . . . . . . . 62.3 Errors in Variables. . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Examine the Residuals. . . . . . . . . . . . . . . . . . . . . . . . 15
We need to say some more about how linear regression, and especially abouthow it really works and how it can fail. Linear regression is important because1. it’s a fairly straightforward technique which often works reasonably well;2. it’s a simple foundation for some more sophisticated techniques;3. it’s a standard method so people use it to communicate; and4. it’s a standard method so people have come to confuse it with predictionand even with causal inference as such.We need to go over (1)–(3), and provide prophylaxis against (4).There are many good books available about linear regression.Faraway(2004) is practical and has a lot of useful R stuﬀ.Berk(2004) omits tech- nical details, but is superb on the high-level picture, and especially on what1

must be assumed in order to do certain things with regression, and what cannotbe done under any assumption.
1 Optimal Linear Prediction
We have a response variable
and a
p
-dimensional vector of predictor variablesor features

. To simplify the book-keeping, we’ll take these to be centered —we can always un-center them later. We would like to predict
using

. Wesaw last time that the best predictor we could use, at least in a mean-squaredsense, is the conditional expectation,
r
(
x
) =
E
|

=
x
(1)Let’s
approximate
r
(
 x
) by a linear function of
x
, say
x
·
β
. This is not anassumption about the world, but rather a decision on our part; a choice, not ahypothesis. This decision can be good — the approximation can be accurate —even if the linear hypothesis is wrong.One reason to think it’s not a crazy decision is that we may hope
r
is asmooth function. If it is, then we can Taylor expand it about our favorite point,say
u
:
r
(
x
) =
r
(
 u
) +
p
i
=1
∂r∂x
i
 u
(
x
i
u
i
) +
O
(
x
u
2
) (2)or, in the more compact vector calculus notation,
r
(
x
) =
r
(
u
) + (
x
u
)
·
r
(
u
) +
O
(
x
u
2
) (3)If we only look at points
x
which are close to
u
, then the remainder terms
O
(
x
u
2
) are small, and a linear approximation is a good one.Of course there are lots of linear functions so we need to pick one, and wemay as well do that by minimizing mean-squared error again:
MSE
(
β
) =
E

·
β
2
(4)Going through the optimization is parallel to the one-dimensional case (see lasthandout), with the conclusion that the optimal
β
is
β
=
V
1
Cov
 X,
(5)where
is the covariance matrix of

, i.e.,
ij
= Cov[
i
,
j
], and Cov
 X,
is the vector of covariances between the predictor variables and
, i.e. Cov
 X,
i
=Cov[
i
,
].If

is a random vector with covariance matrix
, then
V

is a randomvector with covariance matrix
V
. Conversely,
V
1

will be a random vector2

with covariance matrix
I
. We are using the covariance matrix here to transformthe input features into uncorrelated variables, then taking their correlationswith the response — we are
decorrelating
them.Notice:
β
depends on the marginal distribution of

(through the covariancematrix
V
). If that shifts, the optimal coeﬃcients
β
will shift,
unless
the realregression function is linear.
1.1 Collinearity
The formula
β
=
V
1
Cov
 X,
makes no sense if
has no inverse. This willhappen if, and only if, the predictor variables are linearly dependent on eachother — if one of the predictors is really a linear combination of the others.Then (think back to what we did with factor analysis) the covariance matrix isof less than full rank (i.e., rank deﬁcient) and it doesn’t have an inverse.So much for the algebra; what does that mean statistically? Let’s take aneasy case where one of the predictors is just a multiple of the others — sayyou’ve included people’s weight in pounds (
1
) and mass in kilograms (
2
), so
1
= 2
.
2
2
. Then if we try to predict
, we’d haveˆ
=
β
1
1
+
β
2
2
+
β
3
3
+
...
+
β
p
p
= 0
1
+ (2
.
2
β
1
+
β
2
)
2
+
p
i
=3
β
i
i
= (
β
1
+
β
2
/
2
.
2)
1
+ 0
2
+
p
i
=3
β
i
i
=
2200
1
+ (1000 +
β
1
+
β
2
)
2
+
p
i
=3
β
i
i
In other words, because there’s a linear relationship between
1
and
2
, wemake the coeﬃcient for
1
whatever we like, provided we make a correspondingadjustment to the coeﬃcient for
2
, and it has no eﬀect at all on our prediction.So rather than having one optimal linear predictor, we have inﬁnitely many of them.There are two ways of dealing with collinearity. One is to get a diﬀerentdata set where the predictor variables are no longer collinear. The other is toidentify one of the collinear variables (it doesn’t matter which) and drop it fromthe data set. This can get complicated; PCA can be helpful here.
1.2 Estimating the Optimal Linear Predictor
To actually estimate
β
, we need to make some probabilistic assumptions aboutwhere the data comes from. The minimal one we can get away with is thatobservations (

i
,
i
) are independent for diﬀerent values of
i
, with unchangingcovariances. Then if we look at the sample covariances, they will converge on3

## Activity (1)

### Showing

AllMost RecentReviewsAll NotesLikes