You are on page 1of 20

BFC 34303

CIVIL ENGINEERING STATISTICS


Chapter 7
Simple Linear Regression
and Correlation
Faculty of Civil and Environmental Engineering
Universiti Tun Hussein Onn Malaysia

Correlation Analysis
Correlation analysis is the study of the relationship between variables.
The basic idea of correlation analysis is to report the strength of the
relationship between two variables, i.e. the dependent variable and the
independent variable.
The usual first step is to plot the data in a scatter diagram or scatter
plot, which is a chart that portrays the relationship between two variables.
The dependent variable (𝑌) is the variable that is predicted or estimated.
It is also called the response variable.
The independent variable (𝑋) is the variable that provides the basis for
estimation. It is also known as the predictor or explanatory variable.
The strength of the relationship between the dependent and independent
variables is expressed using the coefficient of correlation.
2

1
Coefficient of Correlation
Introduced by Karl Pearson, the coefficient of correlation (𝑟) is used to
describe the strength of the relationship between two sets of variables.
Also referred to as Pearson’s 𝑟, it can take any value from –1 to +1.
The scatter plots below show perfect positive and negative correlation:

𝑌 Line has negative slope 𝑌


Line has positive slope

r = –1.00 r = +1.00

Perfect negative Perfect positive


correlation correlation

𝑋 𝑋

If there is absolutely no relationship between the two sets of variables, 𝑟 is


zero.
If 𝑟 is close to zero, the relationship is considered weak, while an 𝑟 value
close to 1 or –1 indicates a strong relationship.
The scatter plots below show zero, weak and strong correlation between
the variables 𝑌 and 𝑋.
Zero correlation Weak correlation Strong correlation
𝑌 𝑌 𝑌

𝑋 𝑋 𝑋

2
The following summarises the strength and direction of the coefficient of
correlation:

Perfect negative Perfect positive


correlation No correlation correlation

Strong Moderate Weak Weak Moderate Strong


negative negative negative positive positive positive
correlation correlation correlation correlation correlation correlation

–1.00 –0.50 0 +0.50 +1.00

Negative correlation Positive correlation

The conceptual form of the coefficient of correlation is as follows:

∑ 𝑋−𝑋 𝑌−𝑌
𝑟=
𝑛−1 𝑠 𝑠

where
𝑋 = independent variable
𝑌 = dependent variable
𝑛 = number of observations
𝑋, 𝑌 = mean of the independent and dependent variables
𝑠 , 𝑠 = standard deviation of the independent and dependent variables

3
The computational formula of the coefficient of correlation based on actual
values of 𝑋 and 𝑌 is given below:

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
𝑛 ∑𝑋 − ∑𝑋 𝑛 ∑𝑌 − ∑𝑌

where
𝑋 = independent variable
𝑌 = dependent variable
𝑛 = number of observations

Example 7.1
A company sells engineering software. Sales No. of
No. of
Software
The company director claims that the Engineer Calls
Sold
more calls his sales team make, the 1 20 30
more software licenses can get sold. 2 40 60
Sales data from 10 sales engineers 3 20 40
were collected to see if he is right.
4 30 60
(a) Draw a scatter plot to show the 5 10 30
relationship between calls and 6 10 40
sales. Comment on the 7 20 40
relationship. 8 20 50
9 20 30
(b) Calculate the coefficient of
10 30 70
correlation and comment on the
value.
8

4
(a) Scatter plot of Number of Software Sold (𝑌) vs Number of Calls (𝑋)
80
There seems to be a
70
positive correlation
NUMBER OF SOFTWARE SOLD

60 between number of
software sold and
50
number of calls.
40
As the number of
30 calls increases, the
number of software
20
sold also increases.
10

0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS

(b) Coefficient of correlation, 𝑟 is determined as follows:

No. of
Sales No. of Calls
Software Sold XY X2 Y2
Engineer (X)
(Y)
1 20 30 600 400 900
2 40 60 2400 1600 3600
3 20 40 800 400 1600
4 30 60 1800 900 3600
5 10 30 300 100 900
6 10 40 400 100 1600
7 20 40 800 400 1600
8 20 50 1000 400 2500
9 20 30 600 400 900
10 30 70 2100 900 4900
𝜮 220 450 10800 5600 22100

10

5
X Y XY X2 Y2
𝜮 220 450 10800 5600 22100

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑟=
𝑛 ∑𝑋 − ∑𝑋 𝑛 ∑𝑌 − ∑𝑌

10 10800 − 220 450


𝑟=
10 5600 − 220 10 22100 − 450

𝑟 = 0.759

The value of 𝑟 = 0.759 is fairly close to 1.00, so we can conclude


that there is a strong positive relationship between calls and sales.
An increase in calls will most likely result in an increase in sales.
So, the director’s claim was true.
11

Coefficient of Determination
The coefficient of determination (𝑅 ) is the proportion of the total
variation in the dependent variable 𝑌 that is explained by the variation in
the independent variable 𝑋.
It is commonly known as the 𝑹-squared value, given that it is the square
of the coefficient of correlation 𝑟.
In Example 7.1, we found 𝑟 = 0.759, which we then concluded as
indicating a “strong” relationship between the variables.
However, the terms “weak”, “moderate” and “strong” are ambiguous
because they do not provide a precise meaning.
A measure that has a more easily interpreted meaning is the coefficient of
determination.

12

6
Referring back to Example 7.1, if 𝑟 = 0.759 then we get 𝑅 = 0.576, found
by 0.759 .
This is a proportion or a percent, so we can say that 57.6% of the
variation in the number of software sold is explained by the variation in the
number of calls.
Thus, the 𝑅 indicates the percent of the variation in the response variable
𝑌 that is explained by the variation in the predictor variable 𝑋.

13

Testing The Significance Of The Coefficient Of


Correlation
If you recall in Example 7.1, only 10 sales engineers were sampled. Could
it be the 𝑟 = 0.759 we obtained was due to chance, and the 𝑟 of the
population is actually different, or worst, equal to zero?
Resolving this dilemma requires a test to answer the obvious question:
Could there be zero correlation in the population from which the sample
was selected?
Let 𝜌 be the correlation in the population. The null and alternative
hypotheses are:
𝐻 : 𝜌=0 (The correlation in the population is zero)
𝐻 : 𝜌≠0 (The correlation in the population is different from zero)

14

7
A 𝑡-test for the coefficient of correlation (two-tailed) is conducted with the 𝑡
value computed using the following equation:
𝑟 𝑛−2
𝑡= with 𝑛 − 2 degrees of freedom
1−𝑟

At the 0.05 significance level, the critical 𝑡 is determined to be 2.306.


Since this is a two-tail test, the decision rule is that the null hypothesis is
rejected if the calculated 𝑡 > 2.306 or calculated 𝑡 < −2.306.

𝑟 𝑛−2 0.759 10 − 2
𝑡= = = 3.297
1−𝑟 1 − 0.759

Since the calculated 𝑡 > 2.306, we reject 𝐻 and conclude that there is
strong evidence to suggest the correlation in the population is not zero.

15

Level of significance for One-Tailed Test, 𝛼/2 Student’s t-


Degree of 0.10 0.05 0.025 0.01 0.005 0.001 0.0005 distribution
freedom,
Level of significance for Two-Tailed Test, 𝛼 table
df
0.20 0.10 0.05 0.02 0.01 0.002 0.001
The table gives the
1 3.078 6.314 12.076 31.821 63.657 318.310 636.620 values of t , df
2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924 where
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
P(Tdf > t , df ) = 
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
and t /2, df
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781 where
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437 P(Tdf > t /2, df ) = /2
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16

8
Simple Linear Regression
Simple linear regression is a statistical method that allows us to
summarise and study relationships between two continuous (quantitative)
variables, which are:
• the independent variable 𝑋, also known as the predictor, regressor or
explanatory variable.
• the dependent variable 𝑌, also known as the response or outcome
variable.
Simple linear regression gets its adjective “simple” because it concerns
the study of only one predictor variable.
In contrast, multiple linear regression gets its adjective “multiple” because
it concerns the study of two or more predictor variables.

17

The relationship between two variables may be a deterministic


relationship, meaning the response and predictor variables have an
“exact” relationship or dependence. For example, the relationship
between circumference and radius of a circle.
A statistical relationship, on the other hand, is not an exact relationship.
For example, the relationship between time spent studying and exam
performance. It is a relationship in which “trend” exists between the
predictor and the response, but there is also some “scatter”. This is typical
of relationships developed using simple linear regression.

18

9
Developing A Regression Equation Using Least
Squares Method
Regression analysis deals with finding the best relationship between 𝑌
and 𝑋, quantifying the strength of the relationship and using methods that
allow for the estimation of the response values, given the values of the
predictors.
A linear equation is developed to define the linear relationship between 𝑌
and 𝑋. This is called the regression equation or regression model.
For this purpose, we use the least squares method. This method
minimises the sum of the squares of the vertical distances between the
observed values 𝑌 and the predicted values 𝑌 .
The regression line produced using the least squares method is
commonly referred to as the best-fit line.
19

80
A C
70
Which line is the
B best-fit line?
NUMBER OF SOFTWARE SOLD

60

50

Answer:
40

The line that has


30
the least sum of
20 the squares of the
10
vertical deviations
or errors about it.
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS

20

10
80
A
70
15 The sum of the
12
squares of the
NUMBER OF SOFTWARE SOLD

60 2
vertical deviations
50
8
= 877
40 2
12
16
30
6
20

10

0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS

21

80

70
The sum of the
16
B squares of the
NUMBER OF SOFTWARE SOLD

60 4
6
vertical deviations
50
4 = 668
6
40
4
6 16
30

20

10

0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS

22

11
80
C
70
14
The sum of the
10
squares of the
NUMBER OF SOFTWARE SOLD

60
4
vertical deviations
50
6 = 660
40
4
14 This was found to
10
30 have the least sum
0
of squares.
20
Therefore, this line
10
is the best-fit line.
0
0 5 10 15 20 25 30 35 40 45
NUMBER OF CALLS

23

The general form of the simple linear regression equation is as follows:

𝑌 = 𝑎 + 𝑏𝑋

where
𝑌 = predicted value of the response variable 𝑌 (dependent variable)
𝑋 = predictor variable (independent variable)
𝑎 = estimated value of 𝑌 when 𝑋 = 0 (the 𝑌-intercept)
𝑏 = the slope of the regression line

Note: 𝑎 and 𝑏 are called the regression coefficients.

24

12
The slope of the regression line can be determined using:

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌
𝑏=
𝑛 ∑𝑋 − ∑𝑋

The 𝑌-intercept can be determined using:


∑𝑌 ∑𝑋
𝑎= −𝑏
𝑛 𝑛
where
𝑋 = independent variable
𝑌 = dependent variable
𝑛 = number of observations
25

Standard Error Of The Estimate


How do we know if the prediction of the variables is accurate? This can be
determined by calculating the standard error of the estimate (𝒔𝒚.𝒙 ),
which is a measure of the dispersion (or scatter) of the observed values
around the regression line.
The standard error of the estimate is based on the squared deviations
from the regression lines. If the value is small, this means that the
regression line is representative of the data.
The standard error of the estimate is given by the formula:

∑ 𝑌−𝑌 ∑ 𝑌 − 𝑎 ∑ 𝑌 − 𝑏 ∑ 𝑋𝑌
𝑠 . = 𝑜𝑟 𝑠 . =
𝑛−2 𝑛−2

26

13
Example 7.2
Refer to the question in Example 7.1.
(a) Determine the regression equation that relates calls and sales.
(b) Estimate the number of software licenses that can get sold if a
sales engineer makes 60 calls.
(c) Estimate the sales increase, given a 10% increase in average calls.
(d) Calculate the standard error of the estimate.

𝑌 = 𝑎 + 𝑏𝑋

where
𝑌 = predicted number of software sales
𝑋 = number of calls
27

(a) X Y XY X2 Y2
𝜮 220 450 10800 5600 22100

𝑛 ∑ 𝑋𝑌 − ∑ 𝑋 ∑ 𝑌 10 10800 − 220 450


𝑏= = = 1.184
𝑛 ∑𝑋 − ∑𝑋 10 5600 − 220

∑𝑌 ∑ 𝑋 450 220
𝑎= −𝑏 = − 1.185 = 18.93
𝑛 𝑛 10 10

Therefore, the regression equation is


𝑌 = 18.93 + 1.184𝑋

28

14
(b) Substitute 𝑋 = 60 into the regression equation.
𝑌 = 18.93 + 1.184𝑋 = 18.93 + 1.184 60 = 89.97 ≈ 90

The estimated number of software licenses sold is 90.

∑ 𝑋 220
(c) The average calls is = = 22
𝑛 10
Predicted sales is 𝑌 = 18.93 + 1.184 22 = 44.98

10% increase in calls is equivalent to 110% 𝑜𝑓 22 = 24.2

Predicted sales is 𝑌 = 18.93 + 1.184 24.2 = 47.58

47.58 − 44.98
Sales increase is × 100% = 5.78%
44.98
29

(d)
∑ 𝑌 − 𝑎 ∑ 𝑌 − 𝑏 ∑ 𝑋𝑌
𝑠 . =
𝑛−2

22100 − 18.93 450 − 1.184 10800


𝑠 . =
10 − 2

𝑠 . = 9.96

The standard error of the estimate of 9.96 shows that the observed
values are dispersed 9.96 units about the regression line, which is a
fairly small dispersion. Hence, the estimation of sales is fairly
accurate.

30

15
Assumptions For Linear Regression
Four key assumptions are made when applying linear regression:

Normality
• The residuals of the regression follow a normal
distribution.

Homoscedasticity
• The residuals of the regression are equally distributed,
meaning they have a constant variance.

Linearity
• The predictor variables have a linear relationship with
the response variable.

• The predictor variables are not correlated with each


No Multicollinearity other. In other words, they are independent of each
other.

31

Notes:
1. If the normality and homoscedasticity assumptions are met, we should not have any
problems in conforming with linearity.
2. For simple linear regression, we do not have to assume that there is no
multicollinearity, because there is only one predictor variable. For multiple linear
regression, this assumption must be met because there are two or more predictor
variables.

Residuals (or error terms), denoted by 𝑒, are the differences between the
observed response variable 𝑌 and predicted response variable 𝑌 .
𝑒 = 𝑌−𝑌

Residual plots are often used to check if the homoscedasticity and


linearity assumptions made are true. A residual plot is a plot of residuals 𝑒
against the predicted variable 𝑌 (also known as the fitted values), or the
predictor variable 𝑋.
32

16
Checking Homoscedasticity and Linearity Using Residual Plot

To do this, we construct a residual plot and conduct a visual inspection.


𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠, 𝑒
The residuals should be:
• unbiased: have an
average value of zero in
any vertical direction.
0
Indicates
• homoscedastic: the homoscedasticity
spread of the residuals
should be the same in any
vertical direction.
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠, 𝑌
• without trend: randomly
Homoscedasticity
spread with no obvious Indicates linearity
Linearity pattern
33

Residual plots showing violation of assumptions:


𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠, 𝑒 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠, 𝑒 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠, 𝑒

0 0 0

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠, 𝑌 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠, 𝑌 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠, 𝑌

Homoscedasticity Homoscedasticity Homoscedasticity

Linearity Linearity Linearity

34

17
Checking Normality Using Normal Probability Plot
The assumption of normality can be tested using a normal probability
plot (normal quantile plot or normal Q-Q plot) of the residuals. It is a plot of
sample quantiles against theoretical quantiles.
A plot that shows a linear trend indicates that the residuals are normally
distributed.
A plot that shows A plot that does
a linear trend not show a linear
trend

Normality Normality

Note: A normal probability plot of the residuals is best produced using Excel or SPSS. Refer to online tutorials. 35

Example 7.3
Referring to Examples 7.1 and 7.2, construct a residual plot with
(a) the fitted values on the 𝑥-axis
(b) the predictor variable on the 𝑥-axis
Based on either one of these plots, are the assumptions of linearity and
homoscedasticity true?

36

18
Calculate the residuals, 𝑒 = 𝑌 − 𝑌 given 𝑌 = 18.93 + 1.184𝑋.

No. of
Sales No. of Calls Fitted Value
Software Sold Residual (𝑒)
Engineer (X) (Y’)
(Y)
1 20 30 42.61 -12.61
2 40 60 66.29 -6.29
3 20 40 42.61 -2.61
4 30 60 54.45 5.55
5 10 30 30.77 -0.77
6 10 40 30.77 9.23
7 20 40 42.61 -2.61
8 20 50 42.61 7.39
9 20 30 42.61 -12.61
10 30 70 54.45 15.55

37

(a) Plot of residuals 𝑒 against fitted values 𝑌


20

15

10
RESIDUALS

0
0 10 20 30 40 50 60 70

-5

-10

-15
FITTED VALUES

38

19
(b) Plot of residuals 𝑒 against predictor variable 𝑋
20

15

10
RESIDUALS

0
0 10 20 30 40 50

-5

-10

-15
PREDICTOR VARIABLE

39

Checking the assumptions of linearity and homoscedasticity:


20

We can conclude that the


15 assumptions of linearity and
homoscedascity are true because
10 the residuals seem to be:
• unbiased: have an average
RESIDUALS

5
value of zero in any vertical
direction.
0
0 10 20 30 40 50
• homoscedastic: the spread of
-5 the residuals are the same in
any vertical direction.
-10
• without trend: randomly
spread with no obvious pattern
-15
PREDICTOR VARIABLE

40

20

You might also like