You are on page 1of 51

Regression

Lecture Objectives
1.Understand Linear
Regression
2.Interpret the Coefficients
Basic Points
 Determine whether a relationship exists;
 Describe the relationship with an equation; and
 Use the equation to make predictions.
 In industry and business one often encounters problems that require
analysis of more than one variable.
 In such cases it might be necessary to determine whether relationships
between variables exist, and if so;
 To analyse them appropriately.
Examples:
 Is there a relationship between income and expenditure?
 Does the amount spent on advertising have influence on the volume of
business?
 Is there a relationship between the inflation rate and the interest rate
charged by the banks?
Analytical Techniques
There are 2 techniques used to estimate a relationship
that may exist between variables:
1.Regression analysis involves a mathematical equation
that explains how the variables are related. The equation is
also used to predict future values of one variable based on
the data of the other variable.
2.Correlation analysis determines the strength and
direction of the relationship between variables.
N/b:
A scatter diagram gives a rough idea of how the variables
X and Y may be related, if the relationship does exist.
Techniques – Contd’
Techniques Contd’

Correlation tells you if there is an association between x and y but it


doesn’t describe the relationship or allow you to predict one variable
from the other.
To do this we need REGRESSION!

Aim of linear regression is to fit a straight line, ŷ = b + ax,


to data that gives best prediction of y for any value of x

This will be the line that minimises distance between data


and fitted line, i.e. the residuals
Linear Regression Model

Relationship Between Variables Is a Linear Function

Population Population Random


Y-Intercept Slope Error

Dependent (Response)
Variable Yi   0  1X i Independent
 i
Variable
(Explanatory)
7

Types of Regression Models

1 Explanatory R egression 2+ Explanatory


Variable M odels Variables

Sim ple M ultiple

Non- Non-
Linear Linear
Linear Linear
Simple Linear Regression

 One objective of simple linear regression is to predict a person’s


score on a dependent variable from knowledge of their score on an
independent variable.
 It is also used to examine the degree of linear relationship between
an independent variable and a dependent variable.
9

Population & Sample Regression Models

Population Random Sample

Unknown Yi   0   1 X i   i
Relationship

Yi   0  1X i   i 

 


10

Model Specification Is Based on Theory

 1. Theory of Field (e.g., Epidemiology)


 2. Mathematical Theory
 3. Previous Research
 4. ‘Common Sense’
Example of Linear Regression

 Predict “productivity” of factory workers based on the “Test of


Assembly Speed” score.
 Predict “GPA” of college students based on the “SAT” score.
 Examine the linear relationship between “Blood cholesterol” and
“fat intake”.
Prediction

 A perfect correlation between two variables produces a line when


plotted in a bivariate scatterplot
 In this figure, every increase of the value of X is associated with an
increase in Y without any exceptions.
 If we wanted to predict values of Y based on a certain value of X, we
would have no problem in doing so with this figure. A value of 2 for X
should be associated with a value of 10 on the Y variable, as indicated
by this graph.
Error of Prediction: “Unexplained Variance”

 Usually, prediction won't be so perfect. Most often, not all the


points will fall perfectly on the line. There will be some error in
the prediction.
 For each value of X, we know the approximate value of Y but not
the exact value.
Unexplained Variance

 We can look at how much each point falls off the line by drawing a
little line straight from the point to the line as shown below.
 If we wanted to summarize how much error in prediction we had
overall, we could sum up the distances (or deviations) represented by
all those little lines.
 The middle line is called the regression line.
The Regression Equation

 The regression equation is simply a mathematical


equation for a line.

 Itis the equation that describes the regression line. In


algebra, we represent the equation for a line with
something like this:
y = a + bx
Regression Line
 If we want to draw a line that is perfectly through the middle of the
points, we would choose a line that had the squared deviations from
the line. Actually, we would use the smallest squared deviations.
 This criterion for best line is called the "Least Squares" criterion or
Ordinary Least Squares (OLS).
 We use the least squares criterion to pick the regression line. The
regression line is sometimes called the "line of best fit" because it is
the line that fits best when drawn through the points.
 It is a line that minimizes the distance of the actual scores from the
predicted scores.
17

Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-inte rc ept
X
Linear Regression Contd’

To find the best line we must minimise the sum of the squares of the
residuals (the vertical distances from the data points to our line)

Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ

Sum of squares of residuals = Σ (y – ŷ)2

we must find values of a and b that minimise


Σ (y – ŷ)2
Best-fit Line

ŷ = ax + b
slope

There are many ways of fitting a line to data. One such method is called the
Least Squares method.
This method produces a straight line that best fits the points on the scatter
diagram.
This is achieved by minimizing the sum of squared deviations between the
observed values and the estimated line.
Moderate linear Obvious nonlinear
association; relationship;
regression OK. regression
inappropriate.
ŷ = 3 + 0.5x ŷ = 3 + 0.5x

Only data set A is clearly suitable for linear regression. Data set C is problematic
because the outlier is very suspicious (likely a typo or an experimental error).

One extreme Only two values


outlier, requiring for x; a redesign is
further due here…
examination.

ŷ = 3 + 0.5x ŷ = 3 + 0.5x
Finding a

Now we find the value of a that gives the min sum of squares

b b b

Trying out different values of a is equivalent to changing the slope of the line, while
b stays constant
Finding b

First we find the value of b that gives the min sum of squares

b
ε b ε
b

 Trying different values of b is equivalent to shifting the line up


and down the scatter plot
Notation
yˆ is the predicted y value on the regression line

yˆ  intercept  slope x yˆ  a  bx



slope < 0 slope = 0 slope > 0

Not all calculators/software use this yˆ  ax  b


convention. Other notations include: ŷ  b0  b1 x
yˆ  variable_name x  constant
24

Coefficient Equations
 Prediction equation

yˆi  ˆ0  ˆ1 xi


 Sample slope
SS xy  xi  x  yi  y 
ˆ1  
2
SS xx  i x  x 
 Sample Y - intercept
ˆ0  y  ˆ1x
Figure 9.1
Vertical deviations from the regression line to the points of the scatter diagram
The values of a and b that minimize the Error Sum of Squares
are given by:

( x)  ( y )
SS xy   xy  n
(Sum of squares for xy)

( x )2
SS x  x 2

n
(Sum of squares for x)

(  y )2
SS y  y 2

n
(Sum of squares for y)

where
SS xy
b
SS x
a  y  bx
28

Computation Table

Xi Yi Xi2 Yi 2 XiYi
2 2
X1 Y1 X1 Y1 X1 Y1
2 2
X2 Y2 X2 Y2 X2 Y2
: : : : :
2 2
Xn Yn Xn Yn XnYn
2 2
 Xi  Yi  Xi  Yi  Xi Yi
29

Parameter Estimation Solution Table

Xi Yi Xi2 Yi2 XiYi


1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
30

Parameter Estimation Solution Table*


2 2
Xi Yi Xi Yi XiYi
4 3.0 16 9.00 12
6 5.5 36 30.25 33
10 6.5 100 42.25 65
12 9.0 144 81.00 108
32 24.0 296 162.50 218
Class Example
You’re an economist for the county cooperative. You gather the following
data:
Fertilizer (lb.) Yield (lb.)
4 3.0
6 5.5
10 6.5
12 9.0
(a). Find the least squares line relating crop yield and fertilizer
(b). Interpret the coefficients
Example 9.2
Solution (page 5)
Consider that
Figure 9.1 (page 2)

Least squares regression line


Minimising sums of squares

 Need to minimise Σ(y–ŷ)2


 ŷ = ax + b
 so need to minimise:

sums of squares (S)


Σ(y - ax - b)2

 Ifwe plot the sums of squares


for all different values of a and b
we get a parabola, because it is a
squared term
Gradient = 0
min S
 So the min sum of squares is at Values of a and b
the bottom of the curve, where
the gradient is zero.
Interpretation

The slope of the regression line


describes how much we expect y to
change, on average, for every unit
change in x.

The intercept is a necessary mathematical descriptor of the regression line. It


does not describe a specific property of the data.
Plotting the least-square regression line
Use the regression equation to find the value of y for two distinct values of x,
and draw the line that goes through those two points.
Hint: The regression line always passes through the mean of x and y.

The points used for drawing the


regression line are derived from
the equation.

They are NOT actual points from


the data set (except by pure
coincidence).
Least-squares regression is only for linear associations

Don’t compute the regression line until you have confirmed that
there is a linear relationship between x and y.
ALWAYS PLOT THE RAW DATA

These data sets all give a


linear regression equation
of about ŷ = 3 + 0.5x.

But don’t report that until


you have plotted the data.
The coefficient of
determination, r 2
yˆ i  y
r 2, the coefficient of determination, is the
square of the correlation coefficient.

r 2 represents the fraction of the variance


in y that can be explained by the
regression model. yi  y

r = 0.87, so r 2 = 0.76
This model explains 76% of individual variations in BAC
r = –0.3, r 2 = 0.09, or 9%
The regression model explains not even 10% of the variations
in y.

r = –0.7, r 2 = 0.49, or 49%


The regression model explains nearly half of the
variations in y.

r = –0.99, r 2 = 0.9801, or ~98%


The regression model explains almost all of the
variations in y.
Residuals
The vertical distances from each point to the least-squares regression
line are called residuals. The sum of all the residuals is by definition 0.

Outliers have unusually large residuals (in absolute value).

Points above the


line have a positive
residual (under
estimation).
Points below the line have a
negative residual (over
estimation).
^
Predicted y
dist. ( y  yˆ )  residual
Observed y
Thousands Manatee
100 powerboats deaths
y = 0.1301x - 43.7 447 13
R² = 0.9061 460 21
481 24
80
498 16
513 24

Manatee deaths
512 20
60
526 15
559 34
585 33
40 614 33
The least-squares 645 39
675 43
regression line is: 20 711 50
719 47

yˆ  0.1301x  43.7 0
681
679
55
38
678 35
400 600 800 1000
696 49
Powerboats (x1000)
713 42
732 60
If Florida were to limit the number of powerboat registrations to 500,000, 755 54
809 66
what could we expect for the number of manatee deaths in a year? 830 82
880 78
yˆ  0.1301(500)  43.7  yˆ  65.05  43.7  21.35 944
962
81
95
978 73
 Roughly 21 manatee deaths. 983 69
1010 79
1024 92

Could we use this regression line to predict the number of manatee


deaths for a year with 200,000 powerboat registrations?
General Linear Model
Linear regression is actually a form of the General Linear Model
where the parameters are a, the slope of the line, and b, the intercept.
y = ax + b +ε
A General Linear Model is just any model that describes the data in
terms of a straight line
Multiple regression
Multiple regression is used to determine the effect of a number of
independent variables, x1, x2, x3 etc, on a single dependent variable, y
The different x variables are combined in a linear way and each has its
own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable, y.
i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Calculating SSR

Dependent variable Population mean: y

Independent variable (x)

The Sum of Squares Regression (SSR) is the sum of the squared


differences between the prediction for each observation and the
population mean.
Regression Formulas

The Total Sum of Squares (SST) is equal to SSR + SSE.

Mathematically,

^ 2
SSR = ∑ ( y – y ) (measure of explained variation)

^
SSE = ∑ ( y – y ) (measure of unexplained variation)

2
SST = SSR + SSE = ∑ ( y – y ) (measure of total variation in y)
The Coefficient of Determination

The proportion of total variation (SST) that is explained by the regression


(SSR) is known as the Coefficient of Determination, and is often referred to as
R . 2

2 SSR SSR
R = =
SST SSR + SSE

The value of R 2can range between 0 and 1, and the higher its value the more
accurate the regression model is. It is often referred to as a percentage.
Standard Error of Regression

The Standard Error of a regression is a measure of its variability. It can be


used in a similar manner to standard deviation, allowing for prediction
intervals.

y ± 2 standard errors will provide approximately 95% accuracy, and 3


standard errors will provide a 99% confidence interval.

Standard Error is calculated by taking the square root of the average


prediction error.

Standard Error =
√ SSE
n-k

Where n is the number of observations in the sample and


k is the total number of variables in the model
The output of a simple regression is the coefficient β and the
constant A. The equation is then:

y=A+β*x+ε

where ε is the residual error.

β is the per unit change in the dependent variable for each unit
change in the independent variable. Mathematically:

∆y
β=
∆x
Multiple Linear Regression

More than one independent variable can be used to explain variance in the
dependent variable, as long as they are not linearly related.

A multiple regression takes the form:

y = A + β 1X 1+ β 2X 2+ … + β k Xk + ε

where k is the number of variables, or parameters.

You might also like