You are on page 1of 18


Published on STAT 501 (
Home > Lesson 13: Weighted Least Squares & Robust Regression

Lesson 13: Weighted Least Squares &
Robust Regression
So far we have utilized ordinary least squares for estimating the regression line. However,
aspects of the data (such as nonconstant variance or outliers) may require a different method
for estimating the regression line. This lesson provides an introduction to some of the other
available methods for estimating regression lines. To help with the discussions in this lesson,
recall that the ordinary least squares estimate is

Because of the alternative estimates to be introduced, the ordinary least squares estimate is
written here as instead of b.

13.1 - Weighted Least Squares
The method of ordinary least squares assumes that there is constant variance in the errors
(which is called homoscedasticity). The method of weighted least squares can be used
when the ordinary least squares assumption of constant variance in the errors is violated
(which is called heteroscedasticity). The model under consideration is

where now is assumed to be (multivariate) normally distributed with mean vector 0 and
nonconstant variance-covariance matrix

If we define the reciprocal of each variance, , as the weight, , then let matrix W
be a diagonal matrix containing these weights:

1 of 18 11-02-2018, 02:53

The weights have to be known (or more usually estimated) up to a proportionality constant.2048 Parent Compare this with the fitted equation for the ordinary least squares model: Progeny = 0.12796 + 0. To illustrate. it reflects the information in that observation.txt [1] dataset.2100 Parent The equations aren't very different but we can gain some intuition into the effects of using weighted least squares by looking at a scatterplot of the data with the two regression lines superimposed: 2 of 18 11-02-2018.psu. we can make a few observations: Since each weight is inversely proportional to the error variance. consider the famous 1877 galton. consisting of 7 measurements each of X = Parent (pea diameter in inches of parent plant) and Y = Progeny (average pea diameter in inches of up to 10 plants grown from seeds of the parent plant). an observation with small error variance has a large weight since it contains relatively more information than an observation with large error variance (small weight). The resulting fitted equation from Minitab for this model is: Progeny = 0. These standard deviations reflect the information in the response Y values (remember these are averages) and so in estimating a regression model we should downweight the obervations with a large standard deviation and upweight the observations with a small standard deviation.12703 + 0. The weighted least squares estimate is then With this setting. 02:53 . Also included in the dataset are standard deviations. In other words we should use weighted least squares with weights equal to 1/ SD. of the offspring peas grown from each parent. So.

then Var(yi) = and wi =1/ xi. For this example the weights were known. 02:53 . In practice. If variance is proportional to some predictor xi. The resulting fitted values of this regression are estimates of . whereas on the right of the graph where the observations are downweighted the red fitted line is slightly further from the data The black line represents the OLS fit. The resulting fitted values of this regression are estimates of . so instead we use either the squared residuals to estimate a variance function or the absolute residuals to estimate a standard deviation function. https://onlinecourses. There are other circumstances where the weights are known: If the i-th response is an average of ni equally variable Some possible variance and standard deviation function estimates include: If a residual plot against a predictor exhibits a megaphone shape. so the weights tend to decrease as the value of Parent increases. so we have to perform an ordinary least squares (OLS) regression first. If a residual plot against the fitted values exhibits a megaphone shape. while the red line represents the WLS fit. then regress the squared residuals against that predictor. for other types of dataset. We then use this variance or standard deviation function to estimate the weights. the structure of W is usually unknown.psu. Provided the regression function is appropriate. then regress the absolute values of the residuals against that predictor. then regress the absolute values of the residuals against the fitted values. The residuals are much too variable to be used directly in estimating the weights. (And remember ). the i-th squared residual from the OLS fit is an estimate of and the i-th absolute residual is an estimate of (which tends to be a more useful estimator in the presence of outliers). then Var(yi) = and wi =1/ ni. If a residual plot of the squared residuals against a predictor exhibits an upward trend. Thus. If the i-th response is a total of ni observations. on the left of the graph where the observations are upweighted the red fitted line is pulled slightly closer to the data points. The standard deviations tend to increase as the value of Parent increases. then Var(yi) = and wi = ni. The resulting fitted values of 3 of 18 11-02-2018.

Use of weights will (legitimately) impact the widths of statistical intervals. Some key points regarding weighted least squares are: 1. i Responses Cost 1 16 77 2 14 70 3 22 85 4 10 50 5 14 62 6 17 70 7 10 55 8 13 63 9 19 88 10 12 57 4 of 18 11-02-2018. In cases where they differ substantially. In some cases. In designed experiments with large numbers of replicates. https://onlinecourses. The difficulty. we then use these weights in estimating a weighted least squares regression model. 5. weights can be estimated directly from sample variances of the response variable at each combination of predictor variables. the procedure can be iterated until estimated coefficients stabilize (often in no more than one or two iterations). the values of the weights may be based on theory or prior research. this is called iteratively reweighted least squares. is determining estimates of the error variances (or standard deviations).txt [2]) was collected from a study of computer-assisted learning by n = 12 students. If a residual plot of the squared residuals against the fitted values exhibits an upward this regression are estimates of .science.psu. The resulting fitted values of this regression are estimates of . 02:53 . 4.Weighted Least Squares Examples Example 1: Computer-Assisted Learning Dataset The data below (ca_learning_new. 13. 2. Weighted least squares estimates of the coefficients will usually be nearly the same as the "ordinary" unweighted estimates. 3.2 . After using one of these methods to estimate the weights. in practice. then regress the squared residuals against the fitted values. We consider some examples of this approach in the next section. .

A scatterplot of the data is given below. 11 18 81 12 11 51 The response is the cost of the computer time (Y) and the predictor is the total number of responses in completing a lesson (X). First an ordinary least squares line is fit to this data. From this scatterplot. 02:53 . Below is the summary of the simple linear regression fit for this data: A plot of the residuals versus the predictor values indicates possible nonconstant variance since there is a very slight "megaphone" pattern: 5 of 18 11-02-2018.psu. a simple linear regression seems appropriate for explaining this

psu. Then we can use Calc > Calculator to calculate the absolute residuals. 02:53 . we will fit this We will turn to weighted least squares to address this possiblity." The summary of this weighted least squares fit is as follows: 6 of 18 11-02-2018. In Minitab we can use the Storage button in the Regression Dialog to store the residuals. https://onlinecourses. Then we fit a weighted least squares regression model by fitting a linear regression model in the usual way but clicking "Options" in the Regression Dialog and selecting the just-created weights as "Weights. use the Storage button to store the fitted values and then use Calc > Calculator to define the weights as 1 over the squared fitted values. A plot of the absolute residuals versus the predictor values is as follows: The weights we will use will be based on regressing the absolute residuals versus the predictor. Specifically. The weights we will use will be based on regressing the absolute residuals versus the predictor.

A plot of the studentized residuals (remember Minitab calls these "standardized" residuals) versus the predictor values when using the weighted least squares method shows how we have corrected for the megaphone shape since the studentized residuals appear to be more randomly scattered about 0: 7 of 18 11-02-2018.psu. The following plot shows both the OLS fitted line (black) and WLS fitted line (red) overlaid on the same scatterplot. 02:53 .edu/stat501/print/book/export/html/351 Notice that the regression estimates have not changed much from the ordinary least squares method.

The regression results below are for a useful model in this situation: A residual plot suggests nonconstant variance related to the value of X3: 8 of 18 11-02-2018.psu. it is crucial that we use studentized residuals to evaluate the aptness of the model. Example 2: Market Share Data Here we have market share data for n = 36 consecutive months (market_share.txt [3]). since these take into account the weights that are used to model the changing With weighted least squares. Let Y = market share of the product. X3 = 1 if discount promotion in effect and 0 otherwise. X1 = price. 02:53 . The usual residuals don't do this and will maintain the same non-constant variance pattern no matter what weights have been used in the analysis. X3X4 = 1 if both discount and package promotions in effect and 0 otherwise. https://onlinecourses.

science. note how the regression coefficients of the weighted case are not much different from those in the unweighted case. The residual variances for the two separate groups defined by the discount pricing variable are: Because of this nonconstant variance. there may not be much of an obvious benefit to using the 9 of 18 11-02-2018.011. you should note how different the SS values of the weighted case are from the SS values for the unweighted case. https://onlinecourses. 02:53 . 2 (in Minitab use Calc > Calculator and define "weight" as 'x3'/0. we will perform a weighted least squares analysis. it is apparent that the values coded as 0 have a smaller variance than the values coded as 1.027 + (1-'x3')/0. The weighted least squares analysis (set the just- defined "weight" variable as "weights" under Options in the Regression dialog) are as follows: An important note is that Minitab’s ANOVA will be in terms of the weighted SS. we use for i = 1. Thus. For the From this plot. Also. When doing a weighted least squares analysis.psu.

txt [4]) has the following variables: Y = sale price of home X1 = square footage of home X2 = square footage of the lot Since all the variables are highly skewed we first transform each variable to its natural logarithm.psu. https://onlinecourses.28 0. we note a slight “megaphone” or “conic” shape of the residuals.05 10 of 18 11-02-2018. Then when we perform a regression analysis and look at a plot of the residuals versus the fitted values (see below).edu/stat501/print/book/export/html/351 weighted analysis (although intervals are going to be more reflective of the data).313 6.57 0.2198 0. 02:53 .0340 35. you should check a plot of the residuals again. If you proceed with a weighted least squares analysis.1103 0.964 0.000 logX1 1. the plot of studentized residuals after doing a weighted least squares analysis is given below and the residuals look okay (remember Minitab calls these standardized residuals).000 1.87 4. Term Coef SE Coef T-Value P-Value VIF Constant 1. Remember to use the studentized residuals when doing so! For this example.000 1. Example 3: Home Price Dataset This dataset (home_price.05 logX2 0.

2014 0. Results and a residual plot for this WLS model: Term Coef SE Coef T-Value P-Value VIF Constant We interpret this plot as having a mild pattern of nonconstant variance in which the amount of variation is related to the size of the mean (which are the fits). Calculate weights equal to 1/fits2. Regress the absolute values of the OLS residuals versus the OLS fitted values and store the fitted values from this regression. So. where "fits" are the fitted values from the regression in the last step. These fitted values are estimates of the error standard deviations. https://onlinecourses.000 8.377 0.72 0. We then refit the original regression model but using these weights this time in a weighted least squares (WLS) regression.000 1.08 logX2 0.psu. 02:53 .0831 0.000 logX1 1.83 0. we use the following procedure to determine appropriate weights: Store the residuals and the fitted values from the ordinary least squares (OLS) regression. Calculate the absolute values of the OLS residuals.0336 35.0217 3.38 0.08 11 of 18 11-02-2018.

leading to distorted estimates of the regression coefficients.3-13. given the technical nature of the material. which makes their residuals larger and easier to identify. scatterplots may be used to assess outliers when a small number of predictors are present. https://onlinecourses. Robust regression down-weights the influence of outliers. suppose we have a data set of size n such that where . Here we have rewritten the error term as to reflect the error term's dependency on the regression coefficients. Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. but can be time consuming and sometimes difficult to the untrained eye. Outliers have a tendency to pull the least squares fit too far in their direction by receiving much more "weight" than they deserve. When some of these assumptions are invalid.Robust Regression Methods Note that the material in Sections . Ordinary least squares is sometimes known as 12 of 18 11-02-2018. Residual diagnostics can help guide you to where the breakdown in assumptions occur.6 is considerably more technical than preceding Lessons. These methods attempt to dampen the influence of outlying cases in order to provide a better fit to the majority of the data. This distortion results in outliers which are difficult to identify since their residuals are much smaller than they would otherwise be (if the distortion wasn't present).edu/stat501/print/book/export/html/351 13. it could be considered optional in the context of this course. the complexity added by additional predictor variables can hide the outliers from view in these scatterplots. 02:53 . The ordinary least squares estimates for linear regression are optimal when all of the regression assumptions are valid. you would expect that the weight attached to each observation would be on average 1/n in a data set with n observations. However. It is offered as an introduction to this advanced topic and. As we have seen. outliers may receive considerably more weight.psu. For our first robust regression method. least squares regression can perform poorly. However. Typically.

A numerical method called iteratively reweighted least squares (IRLS) (mentioned in Section Minimization of the above is accomplished primarily in two steps: 1. Notice -norm regression since it is minimizing the -norm of the residuals (i. ). Thus. if assuming normality. Another quite common robust regression method falls into a class of estimators called M-estimators (and there are also other related classes such as R-estimators and S-estimators.1) is used to iteratively estimate the weighted least squares estimate until a stopping criterion is met. then results in the ordinary least squares estimate. An estimate of is given by where is the median of the residuals. An alternative is to use what is sometimes known as least absolute deviation (or -norm regression). Formally defined. Specifically.psu. which minimizes the -norm of the residuals (i. 2. M-estimators are given by The M stands for "maximum likelihood" since is related to the likelihood function for a suitable assumed residual distribution.e.. Formally defined. https://onlinecourses. is called the influence function.. observations with high residuals (and high squared residuals) will pull the least squares fit more in that direction. 02:53 . the least absolute deviation estimator is which in turn minimizes the absolute value of the residuals (i. Set for each . the squares of the residuals). so a scale-invariant version of the M-estimator is used: where is a measure of the scale. M-estimators attempt to minimize the sum of a chosen function which is acting on the residuals. resulting in a set of p nonlinear equations where . whose properties we will not explore).e.e.. the absolute value of the residuals). for iterations 13 of 18 11-02-2018. Some M-estimators are influenced by the scale of the residuals.

2.psu. Andrew's Sine: where .science. Huber's Method: 14 of 18 where such that Three common functions chosen in M-estimation are given below: 1. 02:53 . https://onlinecourses.

02:53 .psu. resistant regression methods use estimates that are not influenced by any outliers (this comes from the definition of resistant statistics.Resistant Regression Methods The next method we discuss is often used interchangeably with robust regression methods. such as the median).edu/stat501/print/book/export/html/351 where . This is best accomplished by trimming the data. https://onlinecourses. However. which "trims" extreme values from either end (or both ends) of the range of data Tukey's Biweight: where . Whereas robust regression methods attempt to only dampen the influence of outlying cases. 13. 3. 15 of 18 11-02-2018. which are measures of the data that are not influenced by outliers.4 . there is a subtle difference between the two methods that is not usually outlined in the literature.

the minimum and maximum of this data set are and . then the estimator is said to be "best").e. then you just obtain . As for your data. A specific case of the least quantile of squares method where p = 0. We present four commonly used resistant regression methods: 1. https://onlinecourses.5 (i. The order statistics are simply defined to be the data values arranged in increasing order and are written as . respectively. then a method with a high breakdown value should be used.. The least quantile of squares method minimizes the squared order residual (presumably selected as it is most representative of where the data is expected to lie) and is formally defined by where is the percentile (i. 16 of 18 11-02-2018. For example.. the median) and is called the least median of squares method (and the estimate is often written as ). If h = n. Efficiency is a measure of an estimator's variance relative to another estimator (when it is the smallest it can possibly be. and the least trimmed sum of squares method has the same efficiency (asymptotically) as certain There is also one other relevant term when discussing resistant regression methods.psu. The least trimmed sum of absolute deviations method minimizes the sum of the h smallest absolute residuals and is formally defined by where again . but this will likely be computationally expensive. Suppose we have a data set . The least trimmed sum of squares method minimizes the sum of the smallest squared residuals and is formally defined by where . then you just obtain . Therefore. you will want to consider both the theoretical benefits of a certain method as well as the type of data you have. then specify to be either the next greatest or lowest integer value). the least quantile of squares method and least trimmed sum of squares method both have the same maximal breakdown value for certain P. if there appear to be many outliers. ) of the empirical data (if is not an integer. If h = n. which method from robust or resistant regressions do we use? In order to guide you in the decision-making A preferred solution is to calculate many of these estimates for your data and compare their overall fits. the least median of squares method is of low efficiency. 02:53 . As we will see.e. Breakdown values are a measure of the proportion of contamination (due to outlying observations) that an estimation method can withstand and still maintain being robust against the outliers. the resistant regression estimators provided here are all based on the ordered residuals. So. The theoretical aspects of these methods that are often cited include their breakdown values and overall efficiency. 3. 2.

which is a quality measure for robust linear regression. the regression depth of a hyperplane is the smallest number of residuals that need to change sign to make a nonfit.psu. this means there is always a line with regression depth of at least . there are also techniques for ordering multivariate data sets. For example. ordering the residuals).Regression Depth We have discussed the notion of ordering data (e.. (We count the points exactly on the hyperplane as "passed through". https://onlinecourses. which posits no relationship between predictor and response 13. where p is the number of variables (i. The applications we have presented with ordered data have all concerned univariate data sets. ) is the minimum number of points whose removal makes into a nonfit. consider the data in the figure below. the notion of statistical depth is also used in the regression setting. For the simple linear regression example in the plot Hyperplanes with high regression depth behave well in general error models. there exist point sets for which no hyperplane has regression depth larger than this bound. The regression depth of a hyperplane (say. However. because it is combinatorially equivalent to a horizontal hyperplane. there is the notion of regression depth. However. the number of responses plus the number of predictors). There are numerous depth functions.e. the dashed blue line) demonstrates that the black line has regression depth 3. which we do not discuss here. Specifically. including skewed or distributions with heteroscedastic errors. 02:53 . which allows one to define reasonable analogues of univariate order statistics.) A nonfit is a very poor regression hyperplane. 17 of 18 11-02-2018. Statistically speaking. The regression depth of n points in p dimensions is upper bounded by .. In other words.g.5 . Removing the red circles and rotating the regression line until horizontal (i. Statistical depth functions provide a center-outward ordering of multivariate observations.e. parallel to the axis of any of the predictor variables) without passing through any data points.1 A regression hyperplane is called a nonfit if it can be rotated to horizontal (i.

stat501/files /ch14/ca_learning_new. which we do not discuss in greater detail. interpretability. One may wish to then proceed with residual diagnostics and weigh the pros and cons of using this method over ordinary least squares ( it appears that Andrew's Sine method is producing the most significant values for the regression estimates.psu. 13.psu.txt 18 of 18 /data/home_price. regression depth can help provide a measure of a fitted line that best captures the effects due to outliers.psu.0177 0.6 . 02:53 .5235 0.0045 Andrew's Sine [2] https://onlinecourses.0276 0.psu. ------ 1 This definition also has convenient statistical [3] https://onlinecourses. Source URL: https://onlinecourses. A comparison of M-estimators with the ordinary least squares estimator for the quality measurements data set (analysis done in R since Minitab does not include these procedures): Method p-value p-value OLS 29. /ch13/quality_measure.txt [5] https://onlinecourses. such as invariance under affine transformations.0080 0.Robust Regression Examples Quality Measurements Dataset Let us look at the three robust procedures discussed earlier for the quality measurements data set ( Some of these regressions may be biased or altered from the traditional ordinary least squares These estimates are provided in the table below for comparison with the ordinary least squares estimate. In such cases. then you may be confronted with the choice of other regression lines or hyperplanes to consider for your When confronted with While there is not much of a difference Links: [1] https://onlinecourses.txt [4] https://onlinecourses. Huber's Method 29.5269 0.txt [5]).0036 Tukey's Biweight 29.0204 0.5428 0.stat501/files /data/ 0.psu.psu.5248 0.2342 0.