4 Regression

REGRESSION MODELLING
David M. Lane. et al. Introduction to Statistics : pp. 462516
ioc.pdf
margarita.spitsakova@ttu.ee ICY0006: Lecture 4 1 / 24

Contents
1 What is Linear Regression?
2 Properties of a Regression Line
3 Simple Linear Regression Example
4 Regression models in R
ioc.pdf

Next section
ioc.pdf

Introduction
Recall that
In a cause and eect relationship, the independent variable is the cause, and the
dependent variable is the eect.
Linear Regression is used predict or estimate the value of a dependent variable by

modelling it against one or more independent variables.
The variables must be pairwise, continuous and are assumed to have a linear relationship
between them.
This technique is widely popular in predictive analysis.
Here, we focus on the case where there is only one independent variable. This is called
simple regression (as opposed to multiple regression, which handles two or more
independent variables).
Least squares linear regression is a method for predicting the value of a dependent
variable Y, based on the value of an independent variable X.
ioc.pdf

An Introductory Example
Example data:
A scatter plot of the example data. The black line consists of

the predictions, the points are the actual data, and the vertical
lines between the points and the black line represent errors of
prediction (called residuals).
ioc.pdf

Next section
ioc.pdf

Prerequisites for Regression
Simple linear regression is appropriate when the following conditions are satised.
The dependent variable Y has a linear relationship to the independent variable X . To
check this, make sure that the XY scatterplot is linear and that the residual plot shows
1
a random pattern (= normal distributed).

2 For each value of X , the probability distribution of Y has the same standard deviation σ .
When this condition is satised, the variability of the residuals will be relatively constant
across all values of X , which is easily checked in a residual plot.
3 For any given value of X ,
I The Y values are independent, as indicated by a random pattern on the residual
plot.
I The Y values are roughly normally distributed (i.e., symmetric and unimodal). A
little skewness is ok if the sample size is large. A histogram or a dotplot will show
the shape of the distribution.
ioc.pdf

The Least Squares Regression Line
Linear regression nds the straight line, called the least squares regression line or LSRL,
that best represents observations in a bivariate data set.
Suppose Y is a dependent variable, and X is an independent variable. The population
regression line is:
Y = B0 + B1 X
where B0 is a constant, B1 is the regression coecient, X is the value of the independent
variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by:
yb = b0 + b1 x
where b0 is a constant, b1 is the regression coecient, x is the value of the independent

variable, and yb is the predicted value of the dependent variable.
ioc.pdf

How to Dene a Regression Line
The function that describes x and y is:
yi = α + β xi + εi .
To nd regression estimates b0 and b1 , one has to solve the following minimization
problem:
n n
Find min Q(b0 , b1 ), for Q(b0 , b1 ) = ∑ εi2 = ∑ (yi − b0 − b1 xi )2
b0 ,b1 i=1 i=1
using calculus we obtain:
∑ni=1 (xi − x)(yi − y ) sy

b1 = = rxy
∑ni=1 (xi − x)2 sx
b0 = y − b1 x
where rxy is the sample correlation coecient between x and y ; and sx and sy are the
sample standard deviation of x and y . A horizontal bar over a quantity indicates the
average value of that quantity.
ioc.pdf

Properties of the Regression Line
When the regression parameters (b0 and b1 ) are dened as described above, the regression line
has the following properties.
The line minimizes the sum of squared dierences between observed values (the y values)
and predicted values (the yb values computed from the regression equation).
The regression line passes through the mean of the X values (c) and through the mean of
the Y values (y ).
The regression constant (b0 ) is equal to the y intercept of the regression line.
The regression coecient (b1 ) is the average change in the dependent variable (Y ) for a
1-unit change in the independent variable (X ). It is the slope of the regression line.
The least squares regression line is the only straight line that has all of these properties.
ioc.pdf

The Coecient of Determination
The coecient of determination (denoted by R 2) is a key output of

regression analysis. It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.
The coecient of determination ranges from 0 to 1.
An R 2 of 0 means that the dependent variable cannot be predicted from the independent
variable.
An R 2 of 1 means the dependent variable can be predicted without error from the
independent variable.
An R 2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R 2 of 0.10 means that 10 percent of the variance in Y is predictable from
X ; an R 2 of 0.20 means that 20 percent is predictable; and so on.
ioc.pdf

The Coecient of Determination (2)
Coecient of determination
The coecient of determination (R 2 ) for a linear regression model with one independent
variable is: 2
∑(xi − x)(yi − y )
R2 =
Nσx σy
where N is the number of observations used to t the model, xi is the x value for observation i ,
x is the mean x value, yi is the y value for observation i , y is the mean y value, σx is the
standard deviation of x , and σy is the standard deviation of y .
If you know the linear correlation (r ) between two variables, then the coecient of
determination (R 2 ) is easily computed using the following formula: R 2 = r 2 .
The standard error about the regression line (often denoted by SE) is a measure of the
average amount that the regression equation over- or under-predicts. The higher the
coecient of determination, the lower the standard error; and the more accurate
predictions are likely to be.
ioc.pdf

Next section
ioc.pdf

Problem Statement
Last year, ve randomly selected students took a math aptitude test before they began their
statistics course. The Statistics Department has three questions.
What linear regression equation best predicts statistics performance, based on math
aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in
statistics?
How well does the regression equation t the data?
ioc.pdf

Finding the Regression Equation
In the table below, the xi and yi columns present scores on the aptitude test and statistics
grades, respectively.
Then
∑ni=1 (xi − x)(yi − y ) 470
b1 = = = 0.644
∑ni=1 (xi − x)2 730
b0 = y − b1 x = 77 − 0.644 · 78 = 26.768
Therefore, the regression equation is: yb = 26.768 + 0.644x . ioc.pdf

Usage of the Regression Equation
Once you have the regression equation, using it is a snap. Choose a value for the
independent variable (x ), perform the computation, and you have an estimated value (yb)
for the dependent variable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the
aptitude test, the estimated statistics grade would be:
yb = 26.768 + 0.644x = 26.768 + 0.644 · 80 = 26.768 + 51.52 = 78.288
Warning: When you use a regression equation, do not use values for the independent
variable that are outside the range of values used to create the equation. Such
an extrapolation can produce unreasonable estimates.
ioc.pdf

Finding the Coecient of Determination
To assess how well the regression equation ts the data, coecient of determination can
be checked.
For our example:
∑ni=1 (xi − x)2 730 √

r r
σx = = = 146 = 12.083
N 5
∑ni=1 (yi − y )2 630 √
r r
σx = = = 126 = 11.225
N 5
2 2 2
470 94

∑(xi − x)(yi − y )
R2 = = = = (0.693)2 = 0.480
Nσx σy 5 · 12.083 · 11.225 135, 632
A coecient of determination equal to 0.48 indicates that about 48% of the variation in
statistics grades (the dependent variable) can be explained by the relationship to math
aptitude scores (the independent variable). This would be considered a good t to the
data, in the sense that it would substantially improve an educator's ability to predict
student performance in statistics class.
ioc.pdf

Next section
ioc.pdf

How To Fit Linear Regression Models?
Use the lm() function to t linear models
Example
> x <- c(95, 85, 80, 70, 60)
> y <- c(85, 95, 70, 65, 70)
> lmMod <- lm (y ~ x)
> lmMod
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
26.7808 0.6438
ioc.pdf

Example: complete information

> lmMod <- lm (y ~ x)
> summary(lmMod)
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
-2.945 13.493 -8.288 -6.849 4.589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.7808 30.5182 0.878 0.445
x 0.6438 0.3866 1.665 0.194
Residual standard error: 10.45 on 3 degrees of freedom

Multiple R-squared: 0.4803, Adjusted R-squared: 0.3071
F-statistic: 2.773 on 1 and 3 DF, p-value: 0.1945
ioc.pdf

Example: Plotting the results
> plot(y~x, ylim=c(20,100))
> abline(lmMod$coefficients[1],lmMod$coefficients[2],col="red",
lwd=3)
or
> plot(y~x, ylim=c(20,100))
> abline(lmMod,col="red", lwd=3)
ioc.pdf

Example: Plotting the residuals
> lmMod$residuals
1 2 3 4 5
-2.945205 13.493151 -8.287671 -6.849315 4.589041
and then
> plot(lmMod$residuals, col="blue", lwd=2)
> abline(h=0)
ioc.pdf

Residual analysis
The residual plots show three typical patterns.
ioc.pdf

Residual analysis
There are many ways to transform variables to achieve linearity for regression analysis. Some
common methods are summarized below.
ioc.pdf

4 Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Regression

Uploaded by

Copyright:

Available Formats

REGRESSION MODELLING

David M. Lane. et al. Introduction to Statistics : pp. 462516

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 1 / 24

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 2 / 24

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 3 / 24

Linear Regression is used predict or estimate the value of a dependent variable by

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 4 / 24

A scatter plot of the example data. The black line consists of

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 5 / 24

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 6 / 24

a random pattern (= normal distributed).

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 7 / 24

where b0 is a constant, b1 is the regression coecient, x is the value of the independent

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 8 / 24

using calculus we obtain:

∑ni=1 (xi − x)(yi − y ) sy

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 9 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 10 / 24

The coecient of determination (denoted by R 2) is a key output of

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 11 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 12 / 24

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 13 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 14 / 24

Therefore, the regression equation is: yb = 26.768 + 0.644x . ioc.pdf

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 15 / 24

yb = 26.768 + 0.644x = 26.768 + 0.644 · 80 = 26.768 + 51.52 = 78.288

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 16 / 24

∑ni=1 (xi − x)2 730 √

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 17 / 24

1 What is Linear Regression?

2 Properties of a Regression Line

3 Simple Linear Regression Example

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 18 / 24

Use the lm() function to t linear models

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 19 / 24

Example: complete information

Residual standard error: 10.45 on 3 degrees of freedom

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 20 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 21 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 22 / 24

The residual plots show three typical patterns.

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 23 / 24

margarita.spitsakova@ttu.ee ICY0006: Lecture 4 24 / 24

You might also like

David M. Lane. et al. Introduction to Statistics : pp. 462516

where b0 is a constant, b1 is the regression coecient, x is the value of the independent

The coecient of determination (denoted by R 2) is a key output of

Use the lm() function to t linear models