Professional Documents
Culture Documents
BFC 34303 Chapter 7 Simple Linear Regression and Correlation PDF
BFC 34303 Chapter 7 Simple Linear Regression and Correlation PDF
Correlation Analysis
Correlation analysis is the study of the relationship between variables.
The basic idea of correlation analysis is to report the strength of the
relationship between two variables, i.e. the dependent variable and the
independent variable.
The usual first step is to plot the data in a scatter diagram or scatter
plot, which is a chart that portrays the relationship between two variables.
The dependent variable (𝑌) is the variable that is predicted or estimated.
It is also called the response variable.
The independent variable (𝑋) is the variable that provides the basis for
estimation. It is also known as the predictor or explanatory variable.
The strength of the relationship between the dependent and independent
variables is expressed using the coefficient of correlation.
2
1
Coefficient of Correlation
Introduced by Karl Pearson, the coefficient of correlation (𝑟) is used to
describe the strength of the relationship between two sets of variables.
Also referred to as Pearson’s 𝑟, it can take any value from –1 to +1.
The scatter plots below show perfect positive and negative correlation:
r = –1.00 r = +1.00
𝑋 𝑋
𝑋 𝑋 𝑋
2
The following summarises the strength and direction of the coefficient of
correlation:
∑ 𝑋−𝑋 𝑌−𝑌
𝑟=
𝑛−1 𝑠 𝑠
where
𝑋 = independent variable
𝑌 = dependent variable
𝑛 = number of observations
𝑋, 𝑌 = mean of the independent and dependent variables
𝑠 , 𝑠 = standard deviation of the independent and dependent variables
3
The computational formula of the coefficient of correlation based on actual
values of 𝑋 and 𝑌 is given below:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
𝑛 ∑𝑋 − ∑𝑋 𝑛 ∑𝑌 − ∑𝑌
where
𝑋 = independent variable
𝑌 = dependent variable
𝑛 = number of observations
Example 7.1
A company sells engineering software. Sales No. of
No. of
Software
The company director claims that the Engineer Calls
Sold
more calls his sales team make, the 1 20 30
more software licenses can get sold. 2 40 60
Sales data from 10 sales engineers 3 20 40
were collected to see if he is right.
4 30 60
(a) Draw a scatter plot to show the 5 10 30
relationship between calls and 6 10 40
sales. Comment on the 7 20 40
relationship. 8 20 50
9 20 30
(b) Calculate the coefficient of
10 30 70
correlation and comment on the
value.
8
4
(a) Scatter plot of Number of Software Sold (𝑌) vs Number of Calls (𝑋)
80
There seems to be a
70
positive correlation
NUMBER OF SOFTWARE SOLD
60 between number of
software sold and
50
number of calls.
40
As the number of
30 calls increases, the
number of software
20
sold also increases.
10
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS
No. of
Sales No. of Calls
Software Sold XY X2 Y2
Engineer (X)
(Y)
1 20 30 600 400 900
2 40 60 2400 1600 3600
3 20 40 800 400 1600
4 30 60 1800 900 3600
5 10 30 300 100 900
6 10 40 400 100 1600
7 20 40 800 400 1600
8 20 50 1000 400 2500
9 20 30 600 400 900
10 30 70 2100 900 4900
𝜮 220 450 10800 5600 22100
10
5
X Y XY X2 Y2
𝜮 220 450 10800 5600 22100
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
𝑛 ∑𝑋 − ∑𝑋 𝑛 ∑𝑌 − ∑𝑌
𝑟 = 0.759
Coefficient of Determination
The coefficient of determination (𝑅 ) is the proportion of the total
variation in the dependent variable 𝑌 that is explained by the variation in
the independent variable 𝑋.
It is commonly known as the 𝑹-squared value, given that it is the square
of the coefficient of correlation 𝑟.
In Example 7.1, we found 𝑟 = 0.759, which we then concluded as
indicating a “strong” relationship between the variables.
However, the terms “weak”, “moderate” and “strong” are ambiguous
because they do not provide a precise meaning.
A measure that has a more easily interpreted meaning is the coefficient of
determination.
12
6
Referring back to Example 7.1, if 𝑟 = 0.759 then we get 𝑅 = 0.576, found
by 0.759 .
This is a proportion or a percent, so we can say that 57.6% of the
variation in the number of software sold is explained by the variation in the
number of calls.
Thus, the 𝑅 indicates the percent of the variation in the response variable
𝑌 that is explained by the variation in the predictor variable 𝑋.
13
14
7
A 𝑡-test for the coefficient of correlation (two-tailed) is conducted with the 𝑡
value computed using the following equation:
𝑟 𝑛−2
𝑡= with 𝑛 − 2 degrees of freedom
1−𝑟
𝑟 𝑛−2 0.759 10 − 2
𝑡= = = 3.297
1−𝑟 1 − 0.759
Since the calculated 𝑡 > 2.306, we reject 𝐻 and conclude that there is
strong evidence to suggest the correlation in the population is not zero.
15
8
Simple Linear Regression
Simple linear regression is a statistical method that allows us to
summarise and study relationships between two continuous (quantitative)
variables, which are:
• the independent variable 𝑋, also known as the predictor, regressor or
explanatory variable.
• the dependent variable 𝑌, also known as the response or outcome
variable.
Simple linear regression gets its adjective “simple” because it concerns
the study of only one predictor variable.
In contrast, multiple linear regression gets its adjective “multiple” because
it concerns the study of two or more predictor variables.
17
18
9
Developing A Regression Equation Using Least
Squares Method
Regression analysis deals with finding the best relationship between 𝑌
and 𝑋, quantifying the strength of the relationship and using methods that
allow for the estimation of the response values, given the values of the
predictors.
A linear equation is developed to define the linear relationship between 𝑌
and 𝑋. This is called the regression equation or regression model.
For this purpose, we use the least squares method. This method
minimises the sum of the squares of the vertical distances between the
observed values 𝑌 and the predicted values 𝑌 .
The regression line produced using the least squares method is
commonly referred to as the best-fit line.
19
80
A C
70
Which line is the
B best-fit line?
NUMBER OF SOFTWARE SOLD
60
50
Answer:
40
20
10
80
A
70
15 The sum of the
12
squares of the
NUMBER OF SOFTWARE SOLD
60 2
vertical deviations
50
8
= 877
40 2
12
16
30
6
20
10
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS
21
80
70
The sum of the
16
B squares of the
NUMBER OF SOFTWARE SOLD
60 4
6
vertical deviations
50
4 = 668
6
40
4
6 16
30
20
10
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS
22
11
80
C
70
14
The sum of the
10
squares of the
NUMBER OF SOFTWARE SOLD
60
4
vertical deviations
50
6 = 660
40
4
14 This was found to
10
30 have the least sum
0
of squares.
20
Therefore, this line
10
is the best-fit line.
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS
23
𝑌 = 𝑎 + 𝑏𝑋
where
𝑌 = predicted value of the response variable 𝑌 (dependent variable)
𝑋 = predictor variable (independent variable)
𝑎 = estimated value of 𝑌 when 𝑋 = 0 (the 𝑌-intercept)
𝑏 = the slope of the regression line
24
12
The slope of the regression line can be determined using:
𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑏=
𝑛 ∑𝑋 − ∑𝑋
∑ 𝑌−𝑌 ∑ 𝑌 − 𝑎 ∑ 𝑌 − 𝑏 ∑ 𝑋𝑌
𝑠 . = 𝑜𝑟 𝑠 . =
𝑛−2 𝑛−2
26
13
Example 7.2
Refer to the question in Example 7.1.
(a) Determine the regression equation that relates calls and sales.
(b) Estimate the number of software licenses that can get sold if a
sales engineer makes 60 calls.
(c) Estimate the sales increase, given a 10% increase in average calls.
(d) Calculate the standard error of the estimate.
𝑌 = 𝑎 + 𝑏𝑋
where
𝑌 = predicted number of software sales
𝑋 = number of calls
27
(a) X Y XY X2 Y2
𝜮 220 450 10800 5600 22100
∑𝑌 ∑ 𝑋 450 220
𝑎= −𝑏 = − 1.185 = 18.93
𝑛 𝑛 10 10
28
14
(b) Substitute 𝑋 = 60 into the regression equation.
𝑌 = 18.93 + 1.184𝑋 = 18.93 + 1.184 60 = 89.97 ≈ 90
∑ 𝑋 220
(c) The average calls is = = 22
𝑛 10
Predicted sales is 𝑌 = 18.93 + 1.184 22 = 44.98
47.58 − 44.98
Sales increase is × 100% = 5.78%
44.98
29
(d)
∑ 𝑌 − 𝑎 ∑ 𝑌 − 𝑏 ∑ 𝑋𝑌
𝑠 . =
𝑛−2
𝑠 . = 9.96
The standard error of the estimate of 9.96 shows that the observed
values are dispersed 9.96 units about the regression line, which is a
fairly small dispersion. Hence, the estimation of sales is fairly
accurate.
30
15
Assumptions For Linear Regression
Four key assumptions are made when applying linear regression:
Normality
• The residuals of the regression follow a normal
distribution.
Homoscedasticity
• The residuals of the regression are equally distributed,
meaning they have a constant variance.
Linearity
• The predictor variables have a linear relationship with
the response variable.
31
Notes:
1. If the normality and homoscedasticity assumptions are met, we should not have any
problems in conforming with linearity.
2. For simple linear regression, we do not have to assume that there is no
multicollinearity, because there is only one predictor variable. For multiple linear
regression, this assumption must be met because there are two or more predictor
variables.
Residuals (or error terms), denoted by 𝑒, are the differences between the
observed response variable 𝑌 and predicted response variable 𝑌 .
𝑒 = 𝑌−𝑌
16
Checking Homoscedasticity and Linearity Using Residual Plot
0 0 0
34
17
Checking Normality Using Normal Probability Plot
The assumption of normality can be tested using a normal probability
plot (normal quantile plot or normal Q-Q plot) of the residuals. It is a plot of
sample quantiles against theoretical quantiles.
A plot that shows a linear trend indicates that the residuals are normally
distributed.
A plot that shows A plot that does
a linear trend not show a linear
trend
Normality Normality
Note: A normal probability plot of the residuals is best produced using Excel or SPSS. Refer to online tutorials. 35
Example 7.3
Referring to Examples 7.1 and 7.2, construct a residual plot with
(a) the fitted values on the 𝑥-axis
(b) the predictor variable on the 𝑥-axis
Based on either one of these plots, are the assumptions of linearity and
homoscedasticity true?
36
18
Calculate the residuals, 𝑒 = 𝑌 − 𝑌 given 𝑌 = 18.93 + 1.184𝑋.
No. of
Sales No. of Calls Fitted Value
Software Sold Residual (𝑒)
Engineer (X) (Y’)
(Y)
1 20 30 42.61 -12.61
2 40 60 66.29 -6.29
3 20 40 42.61 -2.61
4 30 60 54.45 5.55
5 10 30 30.77 -0.77
6 10 40 30.77 9.23
7 20 40 42.61 -2.61
8 20 50 42.61 7.39
9 20 30 42.61 -12.61
10 30 70 54.45 15.55
37
15
10
RESIDUALS
0
0 10 20 30 40 50 60 70
-5
-10
-15
FITTED VALUES
38
19
(b) Plot of residuals 𝑒 against predictor variable 𝑋
20
15
10
RESIDUALS
0
0 10 20 30 40 50
-5
-10
-15
PREDICTOR VARIABLE
39
5
value of zero in any vertical
direction.
0
0 10 20 30 40 50
• homoscedastic: the spread of
-5 the residuals are the same in
any vertical direction.
-10
• without trend: randomly
spread with no obvious pattern
-15
PREDICTOR VARIABLE
40
20