You are on page 1of 24

REGRESSION MODELLING

David M. Lane. et al. Introduction to Statistics : pp. 462516

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 1 / 24


Contents

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

4 Regression models in R

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 2 / 24


Next section

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

4 Regression models in R

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 3 / 24


Introduction

Recall that
In a cause and eect relationship, the independent variable is the cause, and the
dependent variable is the eect.

Linear Regression is used predict or estimate the value of a dependent variable by


modelling it against one or more independent variables.
The variables must be pairwise, continuous and are assumed to have a linear relationship
between them.
This technique is widely popular in predictive analysis.

Here, we focus on the case where there is only one independent variable. This is called
simple regression (as opposed to multiple regression, which handles two or more
independent variables).
Least squares linear regression is a method for predicting the value of a dependent
variable Y, based on the value of an independent variable X.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 4 / 24


An Introductory Example
Example data:

A scatter plot of the example data. The black line consists of


the predictions, the points are the actual data, and the vertical
lines between the points and the black line represent errors of
prediction (called residuals).
ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 5 / 24


Next section

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

4 Regression models in R

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 6 / 24


Prerequisites for Regression

Simple linear regression is appropriate when the following conditions are satised.
The dependent variable Y has a linear relationship to the independent variable X . To
check this, make sure that the XY scatterplot is linear and that the residual plot shows
1

a random pattern (= normal distributed).


2 For each value of X , the probability distribution of Y has the same standard deviation σ .
When this condition is satised, the variability of the residuals will be relatively constant
across all values of X , which is easily checked in a residual plot.
3 For any given value of X ,
I The Y values are independent, as indicated by a random pattern on the residual
plot.
I The Y values are roughly normally distributed (i.e., symmetric and unimodal). A
little skewness is ok if the sample size is large. A histogram or a dotplot will show
the shape of the distribution.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 7 / 24


The Least Squares Regression Line

Linear regression nds the straight line, called the least squares regression line or LSRL,
that best represents observations in a bivariate data set.
Suppose Y is a dependent variable, and X is an independent variable. The population
regression line is:
Y = B0 + B1 X
where B0 is a constant, B1 is the regression coecient, X is the value of the independent
variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by:

yb = b0 + b1 x

where b0 is a constant, b1 is the regression coecient, x is the value of the independent


variable, and yb is the predicted value of the dependent variable.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 8 / 24


How to Dene a Regression Line
The function that describes x and y is:

yi = α + β xi + εi .

To nd regression estimates b0 and b1 , one has to solve the following minimization
problem:
n n
Find min Q(b0 , b1 ), for Q(b0 , b1 ) = ∑ εi2 = ∑ (yi − b0 − b1 xi )2
b0 ,b1 i=1 i=1

using calculus we obtain:

∑ni=1 (xi − x)(yi − y ) sy


b1 = = rxy
∑ni=1 (xi − x)2 sx

b0 = y − b1 x

where rxy is the sample correlation coecient between x and y ; and sx and sy are the
sample standard deviation of x and y . A horizontal bar over a quantity indicates the
average value of that quantity.
ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 9 / 24


Properties of the Regression Line

When the regression parameters (b0 and b1 ) are dened as described above, the regression line
has the following properties.
The line minimizes the sum of squared dierences between observed values (the y values)
and predicted values (the yb values computed from the regression equation).

The regression line passes through the mean of the X values (c) and through the mean of
the Y values (y ).

The regression constant (b0 ) is equal to the y intercept of the regression line.

The regression coecient (b1 ) is the average change in the dependent variable (Y ) for a
1-unit change in the independent variable (X ). It is the slope of the regression line.

The least squares regression line is the only straight line that has all of these properties.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 10 / 24


The Coecient of Determination

The coecient of determination (denoted by R 2) is a key output of


regression analysis. It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.
The coecient of determination ranges from 0 to 1.
An R 2 of 0 means that the dependent variable cannot be predicted from the independent
variable.
An R 2 of 1 means the dependent variable can be predicted without error from the
independent variable.
An R 2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R 2 of 0.10 means that 10 percent of the variance in Y is predictable from
X ; an R 2 of 0.20 means that 20 percent is predictable; and so on.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 11 / 24


The Coecient of Determination (2)

Coecient of determination
The coecient of determination (R 2 ) for a linear regression model with one independent
variable is:  2
∑(xi − x)(yi − y )
R2 =
Nσx σy
where N is the number of observations used to t the model, xi is the x value for observation i ,
x is the mean x value, yi is the y value for observation i , y is the mean y value, σx is the
standard deviation of x , and σy is the standard deviation of y .

If you know the linear correlation (r ) between two variables, then the coecient of
determination (R 2 ) is easily computed using the following formula: R 2 = r 2 .
The standard error about the regression line (often denoted by SE) is a measure of the
average amount that the regression equation over- or under-predicts. The higher the
coecient of determination, the lower the standard error; and the more accurate
predictions are likely to be.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 12 / 24


Next section

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

4 Regression models in R

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 13 / 24


Problem Statement

Last year, ve randomly selected students took a math aptitude test before they began their
statistics course. The Statistics Department has three questions.
What linear regression equation best predicts statistics performance, based on math
aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in
statistics?
How well does the regression equation t the data?

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 14 / 24


Finding the Regression Equation
In the table below, the xi and yi columns present scores on the aptitude test and statistics
grades, respectively.

Then
∑ni=1 (xi − x)(yi − y ) 470
b1 = = = 0.644
∑ni=1 (xi − x)2 730
b0 = y − b1 x = 77 − 0.644 · 78 = 26.768

Therefore, the regression equation is: yb = 26.768 + 0.644x . ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 15 / 24


Usage of the Regression Equation

Once you have the regression equation, using it is a snap. Choose a value for the
independent variable (x ), perform the computation, and you have an estimated value (yb)
for the dependent variable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the
aptitude test, the estimated statistics grade would be:

yb = 26.768 + 0.644x = 26.768 + 0.644 · 80 = 26.768 + 51.52 = 78.288

Warning: When you use a regression equation, do not use values for the independent
variable that are outside the range of values used to create the equation. Such
an extrapolation can produce unreasonable estimates.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 16 / 24


Finding the Coecient of Determination
To assess how well the regression equation ts the data, coecient of determination can
be checked.
For our example:

∑ni=1 (xi − x)2 730 √


r r
σx = = = 146 = 12.083
N 5
∑ni=1 (yi − y )2 630 √
r r
σx = = = 126 = 11.225
N 5
2 2 2
470 94
  
∑(xi − x)(yi − y )
R2 = = = = (0.693)2 = 0.480
Nσx σy 5 · 12.083 · 11.225 135, 632

A coecient of determination equal to 0.48 indicates that about 48% of the variation in
statistics grades (the dependent variable) can be explained by the relationship to math
aptitude scores (the independent variable). This would be considered a good t to the
data, in the sense that it would substantially improve an educator's ability to predict
student performance in statistics class.
ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 17 / 24


Next section

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

4 Regression models in R

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 18 / 24


How To Fit Linear Regression Models?

Use the lm() function to t linear models

Example
> x <- c(95, 85, 80, 70, 60)
> y <- c(85, 95, 70, 65, 70)
> lmMod <- lm (y ~ x)
> lmMod

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
26.7808 0.6438

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 19 / 24


How To Fit Linear Regression Models?
Use the lm() function to t linear models

Example: complete information


> lmMod <- lm (y ~ x)
> summary(lmMod)

Call:
lm(formula = y ~ x)

Residuals:
1 2 3 4 5
-2.945 13.493 -8.288 -6.849 4.589

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.7808 30.5182 0.878 0.445
x 0.6438 0.3866 1.665 0.194

Residual standard error: 10.45 on 3 degrees of freedom


Multiple R-squared: 0.4803, Adjusted R-squared: 0.3071
F-statistic: 2.773 on 1 and 3 DF, p-value: 0.1945
ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 20 / 24


How To Fit Linear Regression Models?
Use the lm() function to t linear models
Example: Plotting the results
> plot(y~x, ylim=c(20,100))
> abline(lmMod$coefficients[1],lmMod$coefficients[2],col="red",
lwd=3)
or
> plot(y~x, ylim=c(20,100))
> abline(lmMod,col="red", lwd=3)

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 21 / 24


How To Fit Linear Regression Models?
Use the lm() function to t linear models
Example: Plotting the residuals
> lmMod$residuals
1 2 3 4 5
-2.945205 13.493151 -8.287671 -6.849315 4.589041
and then
> plot(lmMod$residuals, col="blue", lwd=2)
> abline(h=0)

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 22 / 24


Residual analysis

The residual plots show three typical patterns.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 23 / 24


Residual analysis

There are many ways to transform variables to achieve linearity for regression analysis. Some
common methods are summarized below.

ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 24 / 24

You might also like