Simple Linear Regression and Correlation Analysis

Simple Linear
Regression and
Correlation
Introduction
• Regression refers to the statistical technique of
modeling the relationship between variables.
• In simple linear regression,
regression we model the
relationship between two variables.
variables
• One of the variables, denoted by Y, is called the
dependent variable and the other, denoted by X, is
called the independent variable.
variable
• The model we will use to depict the relationship
between X and Y will be a straight-line relationship.
relationship
• A graphical sketch of the pairs (X, Y) is called a
scatter plot.
plot
Using Statistics
This scatterplot locates pairs of Scatterplot of Advertising Expenditures (X) and Sales (Y)
observations of advertising expenditures on 140
the x-axis and sales on the y-axis. We 120
notice that: 100
80
S ale s
60
 Larger (smaller) values of sales tend to 40
be associated with larger (smaller) values 20
of advertising. 0
0 10 20 30 40 50
A d ve rtising
 The scatter of points tends to be distributed around a positively sloped straight

line.
 The pairs of values of advertising expenditures and sales are not located
exactly on a straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise
linear relationship.
 The line represents the nature of the relationship on average.
Examples of Other Scatterplots
Y
Y
Y
X 0 X X
Y
Y
X X X
Simple Linear Regression Model
 The equation that describes how y is related to x and
an error term is called the regression model.
 The simple linear regression model is:
y = a+ bx +
where:
a and b are called parameters of the model,
a is the intercept and b is the slope.
 is a random variable called the error term.
Assumptions of the Simple Linear Regression Model
•• The
Therelationship
relationshipbetween
between Assumptions of the Simple
XXand
andYYisisaastraight-line
straight-line Y Linear Regression Model
relationship.
relationship.
•• The errorsi iare

Theerrors arenormally
normally
distributedwith
distributed withmean
mean00
and variance22.. The
andvariance The E[Y]=0 + 1 X
errorsare
errors areuncorrelated
uncorrelated
(notrelated)
(not related)in
insuccessive
successive
observations.
observations.
•• That
Thatis: ~N(0,
is: ~ N(0,22)) Identical normal
distributions of errors, all
centered on the
regression line.
X
Errors in Regression
Y
the observed data point
Yi . Yˆ  a  bX the fitted regression line
Yi
{
Error ei  Yi  Yi
Yi the predicted value of Y for X
i
X
Xi
SIMPLE REGRESSION AND CORRELATION
Estimating Using the Regression Line
First, lets look at the equation of

a straight line is:
Independent
Dependent variable
Y  a  bX
variable
Y-intercept Slope of the line

The Method of Least Squares
To estimate the straight line we have

to use the least squares method.
This method minimizes the sum of squares

of error between the estimated points on the
line and the actual observed points.
The estimating line Ŷ  a  bX

Slope of the best-fitting Regression Line
n XY   X  Y
b
n X    X 
2 2
Y-intercept of the Best-fitting Regression Line
a  Y  bX
SIMPLE REGRESSION - EXAMPLE
Suppose an appliance store conducts a

five-month experiment to determine
the effect of advertising on sales revenue.
The results are shown below.
(File PPT_Regr_example.sav)
Month Advertising Exp.($100s) Sales Rev.($1000S)

1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
X Y X2 XY
1 1 1 1
2 1 4 2
3 2 9 6
4 2 16 8
5 4 25 20
 X  15  Y  10   55  XY  37
X 2
15 10
X  3 Y  2
5 5
n XY   X  Y
b b = 0.7
n X    X 
2 2
a  Y  bX
a  2  0.7  3  0.1
Ŷ  0.1  0.7 X
Standard Error of Estimate
The standard error of estimate is used to

measure the reliability of the estimating
equation.
It measures the variability or scatter of

the observed values around the regression
line.
s    Y  Ŷ 
2
n2
e
Short-cut
s  Y  a  Y  b  XY
2
n2
e
Y2
1
1
se   Y  a  Y  b XY
2
4
4
n2
16
 Y  26
2 26    0.1 10   0.7 37 
se 
52
 0.6055
Correlation Analysis
Correlation analysis is used to describe
the degree to which one variable is
linearly related to another.
There are two measures for describing

correlation:
1.The Coefficient of Correlation
2.The Coefficient of Determination

Correlation
Thecorrelation
The correlationbetween
betweentwo
tworandom
randomvariables,
variables,XXand
andY,
Y,isisaameasure
measureof
ofthe
the
degreeof
degree of linear
linearassociation
associationbetween
betweenthe
thetwo
twovariables.
variables.
Thepopulation
The populationcorrelation,
correlation,denoted
denotedby,
by,can
cantake
takeon
onany
anyvalue
valuefrom
from-1
-1toto1.1.
indicatesaaperfect
indicates perfectnegative
negativelinear
linearrelationship
relationship
-1<<<<00 indicates
-1 indicatesaanegative
negativelinear
linearrelationship
relationship
indicatesno
indicates nolinear
linearrelationship
relationship
00<<<<11 indicates
indicatesaapositive
positivelinear
linearrelationship
relationship
indicatesaaperfect
indicates perfectpositive
positivelinear
linearrelationship
relationship
Theabsolute
The absolutevalue ofindicates
valueof indicatesthe
thestrength
strengthor
orexactness
exactnessof
ofthe
therelationship.
relationship.
Illustrations of Correlation
Y Y Y
= -1 = 0
= 1
X X X
Y Y Y
= -.8 = 0
= .8
X X X
The coefficient of correlation:
n xy   x  y
r
n x 2 2

   x  n y    y 
2 2

2
Sample Coefficient of Determination r
a  Y  b XY  nY
2
Alternate Formula r 
2
Y 2
 nY 2
Sample Coefficient of Determination
a  Y  b XY  nY 2
r 
2
 Y  nY
2 2
 0.110  0.7 37   5 2
2
r 
2  0.8167
26  5 2
2
Interpretation: Percentage of
We can conclude that 81.67 % of the total variation
variation in the sales revenues is explain explained by
the regression.
by the variation in advertising
expenditure.
The Coefficient of Correlation or
Karl Pearson’s Coefficient of Correlation
The coefficient of correlation is the square

root of the coefficient of determination.
The sign of r indicates the direction of the

relationship between the two variables X
and Y.
The sign of r will be the same as the

sign of the coefficient “b” in the regression
equation Y = a + b X
If the slope of the estimating :- r is the positive

line is positive square root
If the slope of the estimating :- r is the negative

line is negative square root
r r 2
r  0.8167  0.9037
The relationship between the two variables is direct
Hypothesis Tests for the Correlation
Coefficient
H0: = 0 (No linear relationship)
H1: 0 (Some linear relationship)
Test Statistic: r
t( n  2 ) 
1 r 2
n2
Analysis-of-Variance Table and
an F Test of the Regression Model
H0 : The regression model is not significant
H1 : The regression model is significant
Sourceof
Source of Sum
Sumof
of Degreesof
Degrees of
Variation Squares
Variation Squares Freedom Mean
Freedom MeanSquare
Square FFRatio
Ratio
Regression SSR
Regression SSR (1)
(1) MSR
MSR MSR
MSR
MSE
MSE
Error
Error SSE
SSE (n-2)
(n-2) MSE
MSE
Total
Total SST
SST (n-1)
(n-1) MST
MST
Testing for the existence of linear relationship
 We pose the question:

Is the independent variable linearly related to the
dependent variable?
 To answer the question we test the hypothesis
H0: b = 0
H1: b is not equal to zero.
 If b is not equal to zero, the model has some validity.
b
Test statistic, with n-2 degrees of freedom: t 
sb
Correlations
Advertisi
ng Sales
expenses revenue
($00) ($000)
Advertising Pearson 1 .904*
expenses ($00) Correlation
Sig. (2-tailed) .035
N 5 5
Sales revenue Pearson .904* 1
($000) Correlation
Sig. (2-tailed) .035
N 5 5
*. Correlation is significant at the 0.05
level (2-tailed).
Model Summary
Adjusted R Std. Error of
Model R R Square Square the Estimate
1 .904a .817 .756 .606
a. Predictors: (Constant), Advertising expenses ($00)
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 4.900 1 4.900 13.364 .035a
Residual 1.100 3 .367
Total 6.000 4
a. Predictors: (Constant), Advertising expenses
($00)
b. Dependent Variable: Sales revenue
($000)
Alternately, R2 = 1-[SS(Residual) / SS(Total)] =

1-(1.1/6.0)=0.817
When adjusted for degrees of freedom,
Adjusted R2 = 1-[SSResidual/(n-k-1)] / [SS(Total)/(n-1)] = 1-
[1.1//3]/[6/4] = 0.756
Coefficientsa
Standar
dized
Unstandardized Coefficie
Coefficients nts
Std.
Model B Error Beta t Sig.
1 (Constant) -.100 .635 -.157 .885
Advertising
expenses ($00) .700 .191 .904 3.656 .035
a. Dependent Variable: Sales revenue

($000)
Ŷ  0.1  0.7 X
MSR
Test Statistic F 
MSE
Value of the test statistic: F  13.364

The p-value is 0.035
Conclusion:
Conclusion:ThereThereisissufficient
sufficientevidence
evidencetotoreject
reject
the null hypothesis in favor of the alternative hypothesis.
the null hypothesis in favor of the alternative hypothesis.
isisnot
notequal
equaltotozero.
zero.Thus,
Thus,the
theindependent
independentvariable
variableisis
linearly
linearlyrelated
relatedtotoy.y.
This
Thislinear
linearregression
regressionmodel
modelisisvalid
valid
b
Test statistic, with n-2 degrees of freedom: t 
sb
Rejection Region t  t0.05 / 3  3.182

0. 7
Value of the test statistic:
t   3.66
0.191
Conclusion:
The calculated test statistic is 3.66 which is outside
the acceptance region. Alternately, the actual
significance is 0.035. Therefore we will reject the null
hypothesis. The advertising expenses is a significant
explanatory variable.

Simple Linear Regression and Correlation Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Linear Regression and Correlation Analysis

Uploaded by

Copyright:

Available Formats

Simple Linear

the x-axis and sales on the y-axis. We 120

notice that: 100

 Larger (smaller) values of sales tend to 40

be associated with larger (smaller) values 20

 The scatter of points tends to be distributed around a positively sloped straight

•• The errorsi iare

Yi . Yˆ  a  bX the fitted regression line

Estimating Using the Regression Line

First, lets look at the equation of

Y-intercept Slope of the line

The Method of Least Squares

To estimate the straight line we have

This method minimizes the sum of squares

The estimating line Ŷ  a  bX

Y-intercept of the Best-fitting Regression Line

Suppose an appliance store conducts a

Month Advertising Exp.($100s) Sales Rev.($1000S)

The standard error of estimate is used to

It measures the variability or scatter of

Standard Error of Estimate

There are two measures for describing

1.The Coefficient of Correlation

2.The Coefficient of Determination

The coefficient of correlation is the square

The sign of r indicates the direction of the

The sign of r will be the same as the

If the slope of the estimating :- r is the positive

If the slope of the estimating :- r is the negative

 We pose the question:

 If b is not equal to zero, the model has some validity.

Alternately, R2 = 1-[SS(Residual) / SS(Total)] =

a. Dependent Variable: Sales revenue

Value of the test statistic: F  13.364

Rejection Region t  t0.05 / 3  3.182

You might also like