Professional Documents
Culture Documents
Lecture Objectives
1.Understand Linear
Regression
2.Interpret the Coefficients
Basic Points
Determine whether a relationship exists;
Describe the relationship with an equation; and
Use the equation to make predictions.
In industry and business one often encounters problems that require
analysis of more than one variable.
In such cases it might be necessary to determine whether relationships
between variables exist, and if so;
To analyse them appropriately.
Examples:
Is there a relationship between income and expenditure?
Does the amount spent on advertising have influence on the volume of
business?
Is there a relationship between the inflation rate and the interest rate
charged by the banks?
Analytical Techniques
There are 2 techniques used to estimate a relationship
that may exist between variables:
1.Regression analysis involves a mathematical equation
that explains how the variables are related. The equation is
also used to predict future values of one variable based on
the data of the other variable.
2.Correlation analysis determines the strength and
direction of the relationship between variables.
N/b:
A scatter diagram gives a rough idea of how the variables
X and Y may be related, if the relationship does exist.
Techniques – Contd’
Techniques Contd’
Dependent (Response)
Variable Yi 0 1X i Independent
i
Variable
(Explanatory)
7
Non- Non-
Linear Linear
Linear Linear
Simple Linear Regression
Unknown Yi 0 1 X i i
Relationship
Yi 0 1X i i
10
We can look at how much each point falls off the line by drawing a
little line straight from the point to the line as shown below.
If we wanted to summarize how much error in prediction we had
overall, we could sum up the distances (or deviations) represented by
all those little lines.
The middle line is called the regression line.
The Regression Equation
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-inte rc ept
X
Linear Regression Contd’
To find the best line we must minimise the sum of the squares of the
residuals (the vertical distances from the data points to our line)
Residual (ε) = y - ŷ
ŷ = ax + b
slope
There are many ways of fitting a line to data. One such method is called the
Least Squares method.
This method produces a straight line that best fits the points on the scatter
diagram.
This is achieved by minimizing the sum of squared deviations between the
observed values and the estimated line.
Moderate linear Obvious nonlinear
association; relationship;
regression OK. regression
inappropriate.
ŷ = 3 + 0.5x ŷ = 3 + 0.5x
Only data set A is clearly suitable for linear regression. Data set C is problematic
because the outlier is very suspicious (likely a typo or an experimental error).
ŷ = 3 + 0.5x ŷ = 3 + 0.5x
Finding a
Now we find the value of a that gives the min sum of squares
b b b
Trying out different values of a is equivalent to changing the slope of the line, while
b stays constant
Finding b
First we find the value of b that gives the min sum of squares
b
ε b ε
b
yˆ intercept slope x yˆ a bx
Coefficient Equations
Prediction equation
( x) ( y )
SS xy xy n
(Sum of squares for xy)
( x )2
SS x x 2
n
(Sum of squares for x)
( y )2
SS y y 2
n
(Sum of squares for y)
where
SS xy
b
SS x
a y bx
28
Computation Table
Xi Yi Xi2 Yi 2 XiYi
2 2
X1 Y1 X1 Y1 X1 Y1
2 2
X2 Y2 X2 Y2 X2 Y2
: : : : :
2 2
Xn Yn Xn Yn XnYn
2 2
Xi Yi Xi Yi Xi Yi
29
Don’t compute the regression line until you have confirmed that
there is a linear relationship between x and y.
ALWAYS PLOT THE RAW DATA
r = 0.87, so r 2 = 0.76
This model explains 76% of individual variations in BAC
r = –0.3, r 2 = 0.09, or 9%
The regression model explains not even 10% of the variations
in y.
Manatee deaths
512 20
60
526 15
559 34
585 33
40 614 33
The least-squares 645 39
675 43
regression line is: 20 711 50
719 47
yˆ 0.1301x 43.7 0
681
679
55
38
678 35
400 600 800 1000
696 49
Powerboats (x1000)
713 42
732 60
If Florida were to limit the number of powerboat registrations to 500,000, 755 54
809 66
what could we expect for the number of manatee deaths in a year? 830 82
880 78
yˆ 0.1301(500) 43.7 yˆ 65.05 43.7 21.35 944
962
81
95
978 73
Roughly 21 manatee deaths. 983 69
1010 79
1024 92
Mathematically,
^ 2
SSR = ∑ ( y – y ) (measure of explained variation)
^
SSE = ∑ ( y – y ) (measure of unexplained variation)
2
SST = SSR + SSE = ∑ ( y – y ) (measure of total variation in y)
The Coefficient of Determination
2 SSR SSR
R = =
SST SSR + SSE
The value of R 2can range between 0 and 1, and the higher its value the more
accurate the regression model is. It is often referred to as a percentage.
Standard Error of Regression
Standard Error =
√ SSE
n-k
y=A+β*x+ε
β is the per unit change in the dependent variable for each unit
change in the independent variable. Mathematically:
∆y
β=
∆x
Multiple Linear Regression
More than one independent variable can be used to explain variance in the
dependent variable, as long as they are not linearly related.
y = A + β 1X 1+ β 2X 2+ … + β k Xk + ε