You are on page 1of 54

Regression and Least Squares: A MATLAB Tutorial

Dr. Michael D. Porter
porter@stat.ncsu.edu
Department of Statistics North Carolina State University and SAMSI

Tuesday May 20, 2008

1 / 54

Introduction to Regression
Goal: Express the relationship between two (or more) variables by a mathematical formula.
x is the predictor (independent) variable y is the response (dependent) variable

We specifically want to indicate how y varies as a function of x.
y(x) is considered a random variable, so it can never be predicted perfectly.

2 / 54

Example: Relating Shoe Size to Height
The problem

Footwear impressions are commonly observed at crime scenes. While there are numerous forensic properties that can be obtained from these impressions, one in particular is the shoe size. The detectives would like to be able to estimate the height of the impression maker from the shoe size.

3 / 54

Example: Relating Shoe Size to Height
The data

Determining Height from Shoe Size 76 74 72 Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 Shoe Size (Mens) 13 14 15

Data taken from: http://staff.imsa.edu/∼brazzle/E2Kcurr/Forensic/Tracks/TracksSummary.html

4 / 54

Example: Relating Shoe Size to Height
Your answers
Determining Height from Shoe Size
1

76 74 72 Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 Shoe Size (Mens) 13 14 15

What is the predictor? What is the response?

5 / 54

Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1 76 74 2 What is the predictor? What is the response? Can the height of the impression maker be accurately estimated from the shoe size? 72 Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 Shoe Size (Mens) 13 14 15 6 / 54 .

what would you advise the police? 72 Height (in) 70 68 66 3 64 62 60 6 7 8 9 10 11 12 Shoe Size (Mens) 13 14 15 7 / 54 .Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1 76 74 2 What is the predictor? What is the response? Can the height of the impression maker be accurately estimated from the shoe size? If a shoe is size 11.

Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1 76 74 2 What is the predictor? What is the response? Can the height of the impression maker be accurately estimated from the shoe size? If a shoe is size 11.5? 72 Height (in) 70 68 66 3 64 62 60 6 7 8 9 10 11 12 Shoe Size (Mens) 13 14 15 4 8 / 54 . what would you advise the police? What if the size is 7? Size 12.

y(x) is also a random variable The error is additive 9 / 54 .General Regression Model Assume the true model is of the form: y(x) = m(x) + ǫ(x) The systematic part. ǫ(x) is a random variable Measurement error Natural variations due to exogenous factors Therefore. m(x) is deterministic The error.

5 y(x) = A · sin(ωx + φ) + ǫ(x) ¤ ¥ A = 1.Example: Sinusoid Function § ¦ 2 y(x) m(x) 1. σ = 0. σ 2 ) 0. ω = π/2.5 −2 0 1 2 3 4 5 6 7 8 9 10 x 10 / 54 .5 y(x) 0 −0.5 1 Amplitude A Angular frequency ω Phase φ Random error ǫ(x) ∼ N(0. φ = π.5 −1 −1.

Shoe size and height) We often end up using both: constructing models from the observed data and prior knowledge. (e.g.g. Sinusoid Function with A.Regression Modeling We want to estimate m(x) and possibly the distribution of ǫ(x) There are two general situations: Theoretical Models m(x) is of some known (or hypothesized) form but with some parameters unknown. ω. φ unknown) Empirical Models m(x) is constructed from the observed data (e. 11 / 54 .

The Standard Assumptions § ¦ y(x) = m(x) + ǫ(x) ¤ ¥ A1: E[ǫ(x)] = 0 A2: Var[ǫ(x)] = A3: σ2 Cov[ǫ(x). ǫ(x) = y(x) − m(x) 12 / 54 . ǫ(x′ )] ∀x ∀x =0 ∀x = x′ (Mean 0) (Homoskedastic) (Uncorrelated) These assumptions are only on the error term.

m(x). If the model fit is good.Residuals The residuals e(xi ) = y(xi ) − m(xi ) can be used to check the estimated model. 13 / 54 . the residuals should satisfy our three assumptions.

A1 .2 0.6 0.8 1 e(x) 0 −1 −2 −3 0 0.Mean 0 Violates A1 10 8 6 4 2 2 3 Satisfies A1 1 e(x) 0 −2 −4 −6 −8 −10 0 0.8 1 x x 14 / 54 .2 0.4 0.4 0.6 0.

4 0.2 0.6 0.8 1 0 −10 −1 −20 −2 −30 −3 0 0.4 0.A2 .6 0.Constant Variance Violates A2 30 3 20 2 Satisfies A2 10 1 e(x) 0 e(x) 0 0.2 0.8 1 x x 15 / 54 .

6 −0.2 0.2 0.2 −0.Uncorrelated Violates A3 1 0.8 1 e(x) 0 −1 −2 −3 0 0.8 0.4 0.4 0.6 0.2 2 3 Satisfies A3 1 e(x) 0 −0.4 0.6 0.A3 .4 −0.6 0.8 1 x x 16 / 54 .8 −1 0 0.

(Parametric): Assume the trend is linear. Determining Height from Shoe Size 76 74 72 Local Mean Linear Trend Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 13 14 15 Shoe Size (Mens) 17 / 54 .Back to the Shoes How can we estimate m(x) for the shoe example? (Non-parametric): For each shoe size. take the mean of the observed heights.

18 / 54 .Simple Linear Regression Simple linear regression assumes that m(x) is of the parametric form m(x) = β0 + β1 x which is the equation for a line.

Simple Linear Regression Which line is the best estimate? Determining Height from Shoe Size 76 74 72 Line #1 Line #2 Line #3 m(x) = β0 + β1 x Line #1 Line #2 Line #3 β0 48.9 1.5 45.0 β1 1.3 Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 13 14 15 Shoe Size (Mens) 19 / 54 .6 51.6 2.

Estimating Parameters in Linear Regression Data Write the observed data: yi = β0 + β1 xi + ǫi where yi ≡ y(xi ) is the response value for observation i β0 and β1 are the unknown parameters (regression coefficients) xi is the predictor value for observation i ǫi ≡ ǫ(xi ) is the random error for observation i i = 1. . . . . n 20 / 54 . 2.

g(x)) = (y(x) − g(x))2 The best predictor minimizes the Risk (or expected Loss) R(x) = E[L(y(x). L(y(x). g(x))] g∗ (x) = arg min E[L(y(x).Estimating Parameters in Linear Regression Statistical Decision Theory Let g(x) ≡ g(x. β) be an estimator for y(x) Define a Loss Function. g(x))] g∈G 21 / 54 . g(x)) which describes how far g(x) is from y(x) Example Squared Error Loss L(y(x).

β1 ) = i=1 (yi − (β0 + β1 xi ))2 Then it makes sense to estimate (β0 . β1 ) as the values that minimize R(β0 . β1 ) = arg min R(β0 . mi ) = (yi − (β0 + β1 xi ))2 An approximation to the Risk function is the Sum of Squared Errors (SSE ): n R(β0 . β1 ) B0 .B1 22 / 54 .Estimating Parameters in Linear Regression Method of Least Squares If we assume a squared error loss function L(yi . β1 ) (β0 .

Estimating Parameters in Linear Regression Derivation of Linear Least Squares Solution n R(β0 . β1 ) = i=1 (yi − (β0 + β1 xi ))2 Differentiate the Risk function with respect to the unknown parameters and equate to 0 ∂R ∂β0 ∂R ∂β1 n = −2 =0 i=1 n (yi − (β0 + β1 xi )) = 0 xi (yi − (β0 + β1 xi )) = 0 i=1 = −2 =0 23 / 54 .

x y 24 / 54 .Estimating Parameters in Linear Regression Linear Least Squares Solution n R(β0 . β1 ) = i=1 (yi − (β0 + β1 xi ))2 The least square estimates are β1 = n xy i=1 xi yi − n¯¯ n 2 − n¯2 x i=1 xi β0 = ¯ − β1¯ y x where ¯ and ¯ are the sample means of the xi ’s and yi ’s.

Line # 2! Determining Height from Shoe Size 76 74 72 Line #1 Line #2 Line #3 For these data: ¯ = 11.62 Height (in) 70 68 66 64 62 60 6 7 8 9 10 11 12 13 14 15 Shoe Size (Mens) 25 / 54 ...46 β1 = 1.03 ¯ = 69.And the winner is .31 x y β0 = 51.

yi for the ith observation is yi = β0 + β1 xi The residual.Residuals The fitted value. ei is the difference between the observed and fitted value ei = yi − yi The residuals are used to check if our three assumptions appear valid 26 / 54 .

Residuals for shoe size data Determining Height from Shoe Size 5 Residuals 4 3 2 residual 1 0 −1 −2 −3 −4 −5 6 7 8 9 10 11 12 13 14 15 Shoe Size (Mens) 27 / 54 .

8 1 −4 −1 −0.2 0.8 1 x x 28 / 54 .4 −0.8 −0.6 0.2 0 0.6 −0.2 0.4 0.6 −0.4 0.4 −0.2 0 0.8 −0.6 0.Example of poor fit Scatter Plot 9 4 Residual Plot 3 8 7 2 6 1 5 y(x) e(x) 4 3 2 1 0 −1 0 −1 −2 −3 −0.

+ βp xp p = k=0 βk xk This is still considered a linear model m(x) is a linear combination of the βk Danger of over-fitting 29 / 54 .Adding Polynomial Terms in the Linear Model Modeling the mean trend as a line doesn’t seem to fit extremely well in the above example. . There is a systematic lack of fit. . Consider a polynomial form for the mean m(x) = β0 + β1 x + β2 x2 + .

8 1 x x 30 / 54 .6 −0.6 0.4 −0.6 0.8 1 x Residual Plot (Quadratic Fit) 4 x 4 3 3 2 2 1 1 e(x) e(x) 0 0 −1 −1 −2 −2 −3 −3 −4 −1 −0.6 −0.4 −0.2 0.2 0.6 0.4 0.2 0.4 0.4 0.2 0 0.6 −0.6 −0.6 0.8 1 −0.4 −0.8 −0.8 −0.Quadratic Fit: y(x) = β0 + β1 x + β2 x2 + ǫ(x) Scatter Plot 9 1st Order Quadratic 8 9 8 7 7 6 6 5 5 y(x) 4 y(x) 4 3 2 1 0 −1 3 2 1 0 −1 −1 −0.2 0 0.2 0.4 −0.8 −0.8 1 −4 −1 −0.2 0 0.4 0.2 0 0.8 −0.

In matrix   Y=  y1 y2 . . . . . . . we wrote our data as yi = notation this becomes Y = Xβ + ǫ    p k k=0 βk xi + ǫi . . . . . . . x1 p 2 1 x2 x2 . . p 2 1 xn xn . X =    p 2 1 x1 x1 . xn      . .Matrix Approach to Linear Least Squares Setup Previously. .. . . . yn     . x2 . βp      . . ǫ =     ǫ1 ǫ2 . β =     β0 β1 . ǫn      How many unknown parameters are in the model? 31 / 54 . . . . . . .

Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang 32 / 54 .. use Risk function R(β) = (Y − Xβ)T (Y − Xβ) Taking derivative w.r.Matrix Approach to Linear Least Squares Solution To minimize SSE (Sum of Squared Errors)..t β gives the Normal Equations X T Xβ = X T Y The least squares solution for β is .

. Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang β = (X T X)−1 X T Y 33 / 54 ..t β gives the Normal Equations X T Xβ = X T Y The least squares solution for β is . use Risk function R(β) = (Y − Xβ)T (Y − Xβ) Taking derivative w.Matrix Approach to Linear Least Squares Solution To minimize SSE (Sum of Squared Errors).r.

STRETCH BREAK!!! 34 / 54 .

MATLAB Demonstration Linear Least Squares MATLAB Demo #1 Open Regression_Intro.m 35 / 54 .

Model Selection How can we compare and select a final model? How many terms should be include in polynomial models? What is the danger of over-fitting? (Including too many terms) What is the problem with under-fitting? (Not including enough terms) 36 / 54 .

An estimator for the variance is .Estimating Variance Recall assumptions A1.. and A3: Assumptions For our fitted model. A2.. Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom 37 / 54 . the residuals ei = yi − yi can be used to estimate Var[ǫ(x)].

A2. and A3: Assumptions For our fitted model.. An estimator for the variance is . the residuals ei = yi − yi can be used to estimate Var[ǫ(x)].. Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom The Sample Variance s2 = z 1 n−1 n (zi − ¯)2 z i=1 38 / 54 .Estimating Variance Recall assumptions A1.

39 / 54 .Estimating Variance Sample Variance for a rv z s2 = z 1 n−1 n (zi − ¯)2 z i=1 The estimator for the regression problem is similar σǫ = ˆ2 = 1 n − (p + 1) SSE df n e2 i i=1 where the degrees of freedom df = n − (p + 1). There are p + 1 unknown parameters in the model.

we need a distributional assumption on ǫ(x). σ 2 ) 40 / 54 . we haven’t needed one The standard assumption is to assume a Normal or Gaussian distribution A4 : ǫ(x) ∼ N (0.I. Up to now.).Statistical Inference An additional assumption In order to calculate confidence intervals (C.

41 / 54 . we find T T m(xo ) = N xo β. σ 2 (X T X)−1 From these we can find CI’s and perform hypothesis tests. σ 2 (1 + xo (X T X)−1 xo ) β ∼ MVN Xβ. σ 2 ) β = (X T X)−1 X T Y where x0 is a point in design space.Statistical Inference Distributions Using T y(xo ) = x0 β + ǫ(xo ) T y(xo ) ∼ N (x0 β. σ 2 xo (X T X)−1 xo T T y(xo ) = N xo β. And the 4 assumptions.

y Coefficient of Determination R2 = 1 − SSE SST R2 is a measure of how much better a regression model is than the intercept only. 42 / 54 .Model Comparison 2 R Sum of Squares Error n n SSE = i=1 (yi − yi ) = i=1 2 e2 = e′ e i Sum of Squares Total n SST = i=1 (yi − ¯)2 y This is the model with intercept only y(x) = ¯.

Model 2Comparison Adjusted R What happens to R2 if you add more terms in the model? R2 = 1 − SSE SST 43 / 54 .

Model 2Comparison Adjusted R What happens to R2 if you add more terms in the model? R2 = 1 − SSE SST Adjusted R2 penalizes by the number of terms (p + 1) in the model R2 = 1 − adj SSE /(n − (p + 1)) SST /(n − 1) σǫ ˆ =1− SST /(n − 1) Also see residual plots. Mallow’s Cp . 44 / 54 . AIC. etc. PRESS (cross-validation).

MATLAB Demonstration cftool MATLAB Demo #2 Type cftool 45 / 54 .

46 / 54 .Nonlinear Regression A linear regression model can be written p y(x) = k=0 βk hk (x) + ǫ(x) The mean. β) + ǫ(x) for some specified function m(x. β) with unknown parameters β. m(x) is a linear combination of the β’s Nonlinear regression takes the general form y(x) = m(x.

m(x) is a linear combination of the β’s Nonlinear regression takes the general form y(x) = m(x.Nonlinear Regression A linear regression model can be written p y(x) = k=0 βk hk (x) + ǫ(x) The mean. β) with unknown parameters β. β) + ǫ(x) for some specified function m(x. Example The sinusoid we looked at earlier y(x) = A · sin(ωx + φ) + ǫ(x) with parameters β = (A. 47 / 54 . ω. φ) is a nonlinear model.

β))2 Unfortunately. this usually doesn’t have a closed form solution (like in the linear case) Approaches to finding the solution will be discussed later in the workshop But that won’t stop us from using nonlinear (and nonparametric) regression in MATLAB! 48 / 54 .Nonlinear Regression Parameter Estimation Making same assumptions as in linear regression (A1-A3). the least squares solution is still valid n β = arg min i=1 (yi − m(xi .

Off again to cftool MATLAB Demo #3 49 / 54 .

β))2 so observations with larger weights are more important. β))2 Each observation is equally contributes to the risk Weighted regression uses the risk function n Rw (β) = i=1 wi (yi − m(xi . Some examples wi = 1/σi2 wi = 1/xi wi = 1/yi wi = k/|ei | Robust Regression 50 / 54 Heteroskedastic (Non-constant variance) .Weighted Regression Consider the risk functions we have considered so far n R(β) = i=1 (yi − m(xi .

A4 are still valid Standardized x′ = Log x−¯ x sx y′ = log(y) 51 / 54 .Transformations Sometimes transformations are used to obtain better models Transform predictors x → x′ Transform response y → y′ Make sure assumptions A1-A3.

m Scoring will be performed on testing set Want to minimize sum of squared errors When group is ready.The Competition Contest to see who can construct the best model in cftool Get into groups Data can be found in competition data. enter model into this computer 52 / 54 .

look at the Demos tab on the help window The Toolboxes of Statistics (Regression) and Optimization may be particularly useful for this workshop 53 / 54 .MATLAB Help There is lots of good assistance in the MATLAB help window Specifically.

Have a great workshop! 54 / 54 .