Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
0 of .
Results for:
P. 1
Lecture 18

# Lecture 18

Ratings: (0)|Views: 37|Likes:

### Availability:

See more
See less

11/02/2010

pdf

text

original

Extending Linear Regression: Weighted LeastSquares, Heteroskedasticity, Local PolynomialRegression
36-350, Data Mining23 October 2009
Contents
1 Weighted Least Squares
Instead of minimizing the residual sum of squares,
(
β
) =
n
i
=1
(
y
i
x
i
·
β
)
2
(1)we could minimize the
weighted
sum of squares,
WSS
(
β,  w
) =
n
i
=1
w
i
(
y
i
x
i
·
β
)
2
(2)This includes ordinary least squares as the special case where all the weights
w
i
= 1. We can solve it by the same kind of algebra we used to solve theordinary linear least squares problem. But why would we want to solve it? Forthree reasons.1.
Focusing accuracy.
We may care very strongly about predicting the re-sponse for certain values of the input — ones we expect to see often again,ones where mistakes are especially costly or embarrassing or painful, etc.1

— than others. If we give the points
x
i
near that region big weights
w
i
,and points elsewhere smaller weights, the regression will be pulled towardsmatching the data in that region.2.
Discounting imprecision.
Ordinary least squares is the maximum likeli-hood estimate when the
in
=

·
β
+
is IID Gaussian white noise.This means that the variance of
has to be constant, and we measurethe regression curve with the same precision elsewhere. This situation, of constant noise variance, is called
homoskedasticity
. Often however themagnitude of the noise is not constant, and the data are
heteroskedastic
.When we have heteroskedasticity, even if each noise term is still Gaussian,ordinary least squares is no longer the maximum likelihood estimate, andso no longer eﬃcient. If however we know the noise variance
σ
2
i
at eachmeasurement
i
, and set
w
i
= 1
2
i
, we get the heteroskedastic MLE, andrecover eﬃciency.To say the same thing slightly diﬀerently, there’s just no way that we canestimate the regression function as accurately where the noise is large aswe can where the noise is small. Trying to give equal attention to all partsof the input space is a waste of time; we should be more concerned aboutﬁtting well where the noise is small, and expect to ﬁt poorly where thenoise is big.3.
Doing something else.
There are a number of other optimization prob-lems which can be transformed into, or approximated by, weighted leastsquares. The most important of these arises from generalized linear mod-els, where the mean response is some nonlinear function of a linear pre-dictor. (Logistic regression is an example.)In the ﬁrst case, we decide on the weights to reﬂect our priorities. In thethird case, the weights come from the optimization problem we’d really ratherbe solving. What about the second case, of heteroskedasticity?2

-4
-2
0
2
4
-        1        5
-        1        0
-        5
0
5
1        0
1        5
x
y
Figure 1: Black line: Linear response function (
y
= 3
2
x
). Grey curve:standard deviation as a function of
x
(
σ
(
x
) = 1 +
x
2
/
2).
2 Heteroskedasticity
Suppose the noise variance is itself variable. For example, the ﬁgure shows asimple linear relationship between the input
and the response
, but also anonlinear relationship between
and Var[
].In this particular case, the ordinary least squares estimate of the regressionline is 2
.
72
1
.
30
x
, with R reporting standard errors in the coeﬃcients of
±
0
.
52and 0
.
20, respectively. Those are however calculated under the assumption thatthe noise is homoskedastic, which it isn’t. And in fact we can see, pretty much,that there is heteroskedasticity if looking at the scatter-plot didn’t convinceus, we could always plot the residuals against
x
, which we should do anyway.To see whether that makes a diﬀerence, let’s re-do this many times with3