You are on page 1of 31

Simple Linear

Regression and
Correlation
Introduction
• Regression refers to the statistical technique of
modeling the relationship between variables.
• In simple linear regression,
regression we model the
relationship between two variables.
variables
• One of the variables, denoted by Y, is called the
dependent variable and the other, denoted by X, is
called the independent variable.
variable
• The model we will use to depict the relationship
between X and Y will be a straight-line relationship.
relationship
• A graphical sketch of the pairs (X, Y) is called a
scatter plot.
plot
Using Statistics
This scatterplot locates pairs of Scatterplot of Advertising Expenditures (X) and Sales (Y)
observations of advertising expenditures on 140

the x-axis and sales on the y-axis. We 120

notice that: 100

80

S ale s
60

 Larger (smaller) values of sales tend to 40

be associated with larger (smaller) values 20

of advertising. 0
0 10 20 30 40 50
A d ve rtising

 The scatter of points tends to be distributed around a positively sloped straight


line.
 The pairs of values of advertising expenditures and sales are not located
exactly on a straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise
linear relationship.
 The line represents the nature of the relationship on average.
Examples of Other Scatterplots

Y
Y
Y

X 0 X X
Y

Y
X X X
Simple Linear Regression Model
 The equation that describes how y is related to x and
an error term is called the regression model.
 The simple linear regression model is:

y = a+ bx +
where:
a and b are called parameters of the model,
a is the intercept and b is the slope.
 is a random variable called the error term.
Assumptions of the Simple Linear Regression Model
•• The
Therelationship
relationshipbetween
between Assumptions of the Simple
XXand
andYYisisaastraight-line
straight-line Y Linear Regression Model

relationship.
relationship.

•• The errorsi iare


Theerrors arenormally
normally
distributedwith
distributed withmean
mean00
and variance22.. The
andvariance The E[Y]=0 + 1 X

errorsare
errors areuncorrelated
uncorrelated
(notrelated)
(not related)in
insuccessive
successive
observations.
observations.

•• That
Thatis: ~N(0,
is: ~ N(0,22)) Identical normal
distributions of errors, all
centered on the
regression line.

X
Errors in Regression

Y
the observed data point

Yi . Yˆ  a  bX the fitted regression line

Yi
{
Error ei  Yi  Yi
Yi the predicted value of Y for X
i

X
Xi
SIMPLE REGRESSION AND CORRELATION

Estimating Using the Regression Line

First, lets look at the equation of


a straight line is:

Independent
Dependent variable

Y  a  bX
variable

Y-intercept Slope of the line


SIMPLE REGRESSION AND CORRELATION

The Method of Least Squares

To estimate the straight line we have


to use the least squares method.

This method minimizes the sum of squares


of error between the estimated points on the
line and the actual observed points.
SIMPLE REGRESSION AND CORRELATION

The estimating line Ŷ  a  bX


Slope of the best-fitting Regression Line

n XY   X  Y
b
n X    X 
2 2

Y-intercept of the Best-fitting Regression Line

a  Y  bX
SIMPLE REGRESSION - EXAMPLE

Suppose an appliance store conducts a


five-month experiment to determine
the effect of advertising on sales revenue.
The results are shown below.
(File PPT_Regr_example.sav)

Month Advertising Exp.($100s) Sales Rev.($1000S)


1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
SIMPLE REGRESSION - EXAMPLE

X Y X2 XY
1 1 1 1
2 1 4 2
3 2 9 6
4 2 16 8
5 4 25 20
 X  15  Y  10   55  XY  37
X 2

15 10
X  3 Y  2
5 5
SIMPLE REGRESSION - EXAMPLE

n XY   X  Y
b b = 0.7
n X    X 
2 2

a  Y  bX
a  2  0.7  3  0.1

Ŷ  0.1  0.7 X
Standard Error of Estimate

The standard error of estimate is used to


measure the reliability of the estimating
equation.

It measures the variability or scatter of


the observed values around the regression
line.
Standard Error of Estimate

Standard Error of Estimate

s    Y  Ŷ 
2

n2
e

Short-cut

s  Y  a  Y  b  XY
2

n2
e
Standard Error of Estimate
Y2
1
1
se   Y  a  Y  b XY
2

4
4
n2
16
 Y  26
2 26    0.1 10   0.7 37 
se 
52

 0.6055
Correlation Analysis
Correlation analysis is used to describe
the degree to which one variable is
linearly related to another.

There are two measures for describing


correlation:

1.The Coefficient of Correlation

2.The Coefficient of Determination


Correlation
Thecorrelation
The correlationbetween
betweentwo
tworandom
randomvariables,
variables,XXand
andY,
Y,isisaameasure
measureof
ofthe
the
degreeof
degree of linear
linearassociation
associationbetween
betweenthe
thetwo
twovariables.
variables.

Thepopulation
The populationcorrelation,
correlation,denoted
denotedby,
by,can
cantake
takeon
onany
anyvalue
valuefrom
from-1
-1toto1.1.

indicatesaaperfect
indicates perfectnegative
negativelinear
linearrelationship
relationship
-1<<<<00 indicates
-1 indicatesaanegative
negativelinear
linearrelationship
relationship
indicatesno
indicates nolinear
linearrelationship
relationship
00<<<<11 indicates
indicatesaapositive
positivelinear
linearrelationship
relationship
indicatesaaperfect
indicates perfectpositive
positivelinear
linearrelationship
relationship

Theabsolute
The absolutevalue ofindicates
valueof indicatesthe
thestrength
strengthor
orexactness
exactnessof
ofthe
therelationship.
relationship.
Illustrations of Correlation

Y Y Y
= -1 = 0
= 1

X X X

Y Y Y
= -.8 = 0
= .8

X X X
The coefficient of correlation:

n xy   x  y
r
n x 2 2

   x  n y    y 
2 2

2
Sample Coefficient of Determination r
a  Y  b XY  nY
2

Alternate Formula r 
2

Y 2
 nY 2
Sample Coefficient of Determination

a  Y  b XY  nY 2

r 
2

 Y  nY
2 2

 0.110  0.7 37   5 2
2
r 
2  0.8167
26  5 2
2

Interpretation: Percentage of
We can conclude that 81.67 % of the total variation
variation in the sales revenues is explain explained by
the regression.
by the variation in advertising
expenditure.
The Coefficient of Correlation or
Karl Pearson’s Coefficient of Correlation

The coefficient of correlation is the square


root of the coefficient of determination.

The sign of r indicates the direction of the


relationship between the two variables X
and Y.

The sign of r will be the same as the


sign of the coefficient “b” in the regression
equation Y = a + b X
SIMPLE REGRESSION AND CORRELATION

If the slope of the estimating :- r is the positive


line is positive square root

If the slope of the estimating :- r is the negative


line is negative square root

r r 2

r  0.8167  0.9037
The relationship between the two variables is direct
Hypothesis Tests for the Correlation
Coefficient
H0: = 0 (No linear relationship)
H1: 0 (Some linear relationship)

Test Statistic: r
t( n  2 ) 
1 r 2

n2
Analysis-of-Variance Table and
an F Test of the Regression Model
H0 : The regression model is not significant
H1 : The regression model is significant

Sourceof
Source of Sum
Sumof
of Degreesof
Degrees of
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio

Regression SSR
Regression SSR (1)
(1) MSR
MSR MSR
MSR
MSE
MSE
Error
Error SSE
SSE (n-2)
(n-2) MSE
MSE
Total
Total SST
SST (n-1)
(n-1) MST
MST
Testing for the existence of linear relationship

 We pose the question:


Is the independent variable linearly related to the
dependent variable?
 To answer the question we test the hypothesis

H0: b = 0
H1: b is not equal to zero.

 If b is not equal to zero, the model has some validity.

b
Test statistic, with n-2 degrees of freedom: t 
sb
Correlations
Advertisi
ng Sales
expenses revenue
($00) ($000)
Advertising Pearson 1 .904*
expenses ($00) Correlation
Sig. (2-tailed) .035
N 5 5
Sales revenue Pearson .904* 1
($000) Correlation
Sig. (2-tailed) .035
N 5 5
*. Correlation is significant at the 0.05
level (2-tailed).
Model Summary
Adjusted R Std. Error of
Model R R Square Square the Estimate
1 .904a .817 .756 .606
a. Predictors: (Constant), Advertising expenses ($00)
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 4.900 1 4.900 13.364 .035a
Residual 1.100 3 .367
Total 6.000 4
a. Predictors: (Constant), Advertising expenses
($00)
b. Dependent Variable: Sales revenue
($000)

Alternately, R2 = 1-[SS(Residual) / SS(Total)] =


1-(1.1/6.0)=0.817
When adjusted for degrees of freedom,
Adjusted R2 = 1-[SSResidual/(n-k-1)] / [SS(Total)/(n-1)] = 1-
[1.1//3]/[6/4] = 0.756
Coefficientsa
Standar
dized
Unstandardized Coefficie
Coefficients nts
Std.
Model B Error Beta t Sig.
1 (Constant) -.100 .635 -.157 .885
Advertising
expenses ($00) .700 .191 .904 3.656 .035

a. Dependent Variable: Sales revenue


($000)

Ŷ  0.1  0.7 X
MSR
Test Statistic F 
MSE

Value of the test statistic: F  13.364


The p-value is 0.035

Conclusion:
Conclusion:ThereThereisissufficient
sufficientevidence
evidencetotoreject
reject
the null hypothesis in favor of the alternative hypothesis.
the null hypothesis in favor of the alternative hypothesis.
isisnot
notequal
equaltotozero.
zero.Thus,
Thus,the
theindependent
independentvariable
variableisis
linearly
linearlyrelated
relatedtotoy.y.
This
Thislinear
linearregression
regressionmodel
modelisisvalid
valid
b
Test statistic, with n-2 degrees of freedom: t 
sb

Rejection Region t  t0.05 / 3  3.182


0. 7
Value of the test statistic:
t   3.66
0.191
Conclusion:
The calculated test statistic is 3.66 which is outside
the acceptance region. Alternately, the actual
significance is 0.035. Therefore we will reject the null
hypothesis. The advertising expenses is a significant
explanatory variable.

You might also like