Professional Documents
Culture Documents
ENGINEERING
STATISTICS
Semester 2
2021/2022
CHAPTER 5 :
SIMPLE LINEAR
REGRESSION
IMK, FSGM
At the end of this chapter, you must be able to answer the following questions:
5.1 & 5.2:
• What is correlation? (Textbook: LO1-LO4)
• How to find the correlation? (Textbook: LO1-LO4)
• How to infer about the significant of the correlation? (Textbook: LO5)
• What is a simple linear regression? (Textbook: LO7)
5.3:
• What are the concepts to build/estimate a simple linear regression model? (Textbook: LO6)
• What is a simple linear regression model? What is the difference between correlation & regression? (Textbook: LO7- LO8)
5.4:
• Variation is everywhere : What are the component of variations around regression line? (Textbook: LO9)
• How a simple linear regression model works & how to evaluate? (Textbook: LO9-LO12)
• How to check the assumptions that need to be met for a simple linear regression model to be valid? (Textbook: LO13)
5.5:
• How to perform statistical inference for a simple linear regression model? (Textbook: LO14)
• How a simple linear regression model is used to estimate and predict likely values? (Textbook: LO15)
The Learning Dots
Correlation Regression
Analysis Analysis
Textbook Pg 186
HOW TWO VARIABLES ARE RELATED?
25 1.5
1.4
20
1.3
y-variable
y-variable
15 1.2
10 1.1
1
5
0.9
0 0.8
-5 5 15 25 35 3.5 4.5 5.5 6.5 7.5
x-variable x-variable
HOW TWO VARIABLES ARE RELATED?
Image Source:
Image Source: support.minitab.com https://www.polarsparc.com/xhtml/LinearRegression-3.html
What &
Why • Used to predict the value of the outcome
(dependent) variable based on the value of the
predictor (independent) variable.
Image Source:
https://www.polarsparc.com/xhtml/LinearRegression-3.html
Image Source: support.minitab.com
Textbook Pg 187
How to find the relationship / association using a graph?
WHEN to use?
• To explore potential relationship/association between the
two variables (pairs of variables).
• To answer this question: As the values of one variable
change, do we see corresponding changes in the other
variable?
HOW to draw?
• x-axis : predictor/explanatory/independent variable.
• y- axis : outcome/response/dependent variable.
How to find the relationship / association using a graph?
As x – variable , y – variable
No relationship
Negative relationship
As x – variable , y – variable
r=
( x − x )( y − y )
i i
HOW to calculate 𝒓? Formula to
(x − x ) (y − y )
2 2
i i calculate r
n
1 n n
xi y i − xi y i
n i =1 i =1 Sxy
r= i =1
=
n 2 1 n 2
n 2 1 n 2
Sxx Syy Formula to
xi − xi yi − yi calculate r.
i =1 n i =1 i =1 n i =1
How to describe numerically the relationship between two
variables?
WHAT is the property 𝒓? Weak
moderate
−𝟏 ≤ 𝒓 ≤ 𝟏 Very weak
Very strong
Relationship between Correlation Coefficient & Scatter Plot
OTHER EXAMPLES
How correlation coefficient relates with a SCATTER plot?
HOW to interpret 𝒓?
• The nearer 𝒓 to 1 or -1, the …………. the linear
relationship.
Curvilinear relationship
Exercise 1 What do you think the correlation coefficient for Plot A & B would be?
Solution:
Plot A Plot B
25 1.5
1.4
20
1.3
y-variable
y-variable
15 1.2
1.1
10
1
5 0.9
0 0.8
0 10 20 30 40 3.5 4.5 5.5 6.5 7.5
x-variable x-variable
From the scatter plot and the correlation coefficient, can we generalized the
existence of linear relationship to the population of interest?
Sample Correlation Coefficient : From Sample to Population
The value of r is computed from sample, there are two possibilities when
r is not equal to zero either:
OR
• Value that separates rejection and non- 2. Obtain the critical values :
rejection region.
• Determine by significance level & the types of ±𝑡𝛼,𝑑𝑓=𝑛−2 since test is a …………. test
test. 2
(𝐻1 : 𝜌 ≠ 0 )
• Significance level is the ………….
(Probability you ………….when H0 is ………….).
• Value that we used to make decision about 3. Compute the test statistic.
the …………. hypothesis.
• The …………. the value, the …………. the t-test used to determine significance of linear
chance to support the H1. 𝑛−2
correlation : 𝑡𝑡𝑒𝑠𝑡 = 𝑟
1−𝑟 2
Hypothesis testing for population correlation coefficient, 𝝆
• If you reject H0, it means that you have 5. State conclusion in context of the problem
…………. evidence from the sample data to and claim.
support H1 at the specified significance level.
Decision
Remarks:
Always write the conclusion:
• in context of the x-variable & y-variable.
• the significance level.
i. Can you conclude the correlation between the two variables is
Exercise 2 statistically significant at 5% level?
ii. How would you relate i. and the calculated sample correlation coefficient
from Exercise 1?
Solution:
i.
1. State hypotheses:
𝑯𝟎 : There is no linear relationship between ….. and …… (𝝆 = 𝟎)
𝑯𝟏 : There is a linear relationship between ….. and …… (𝝆 ≠ 𝟎)
25
𝑛−2 25−2
3. t -test : 𝑡 = 𝑟 = 0.9574 = 15.918 20
1−𝑟 2 1−(0.9574)2
y-variable
15
10
0
0 10 20 30 40
x-variable
5. Conclusion:
There is sufficient evidence to conclude that there is a significant linear relationship
between … and ….. at 5% significance level.
i. Can you conclude the correlation between the two variables is
Exercise 3 statistically significant at 5% level?
ii. How would you relate i. and the calculated sample correlation
coefficient from Exercise 1?
1.5
1.4
1.3
y-variable
1.2
1.1
1
0.9
0.8
3.5 4.5 5.5 6.5 7.5
x-variable
Recall: “What does it mean when we can conclude the linear correlation is
“statistically significant”?
Solution:
Example 5.5 (text book):
Perform the hypothesis test for the significance of linear correlation between Physics
scores and Mathematics scores in Example 5.1 at α=0.05.
Solution:
1. 𝑯𝟎 :There is no linear relationship between Physics scores and Mathematics scores (𝝆 = 𝟎)
𝑯𝟏 : There is a linear relationship between Physics scores and Mathematics scores (𝝆 ≠ 𝟎)
𝑛−2 7−2
3. t -test : 𝑡 = 𝑟 = 0.991 = 16.55
1−𝑟 2 1−(0.991)2
5. Conclusion:
There is sufficient evidence to conclude that there is a significant linear relationship
between Physics scores and Mathematics scores at 5% significance level.
List three things you have learned so far ….
QUESTIONS FOR SELF-
REFLECTION
• Why scatter plot? (LO2 Textbook)
Simple Linear
Regression
Correlation Regression
Analysis Analysis
Connecting the dots is the way we make information useful ~ Carolyn Sloan
https://www.linkedin.com/pulse/connecting-dots-closer-look-value-learning-how-learn-carolyn-sloan
5.3
REGRESSION
ANALYSIS
Textbook Pg 195
WHAT IS A LEAST-SQUARE REGRESSION LINE OR
BEST-FIT LINE?
When r is positive, the line slopes When r is negative, the line slopes
upward to the right or, as the downward to the right or, as the
value of x ..……, y ………. value of x ………., y ………..
• The closer the line to all points on the scatter plot the better the relationship.
• The reason you need a line of best fit is that the values of y will be predicted from the
values of x ; hence, the closer the points are to the line, the better the fit and the
prediction will be.
Deterministic Model Output can be predicted with
certainty. No elements of
randomness/uncertainty.
Example: In mathematics: A deterministic
model is a mathematical model in which
the output is determined only by the
specified values of the input data and
the initial conditions.
y i = 0 + 1xi + i
Simple Linear
Regression where ;
Model 0 is the population intercept of the line with the y-axis
is the population slope of the line
1
is the random error/unexplained variation in y
Population Sample
y-intercept 𝛽0 𝛽መ0
𝑦ො = 𝛽መ0 +𝛽መ1 𝑥
True mean of y for a
Regression given x. We assume all 𝜇𝑦|𝑥 = 𝛽0 + 𝛽1 𝑥
equation points fall on the line.
How it works: The sum of squares (SS) of the …………. of data points (observed values) and
the regression line …………. .
intercept
ˆ0 = y − ˆ1x
Independent variable / estimated value of y
explanatory variable for a given value of x.
The line always passes through the mean of both x-variable and y-variable, ( x , y )
Properties of Least-Square Regression Line &
Residuals
1 ˆ = 0
i
4
(ˆ )
2
2 i
as minimum as possible.
3
4 Unbiased estimate
Textbook Pg 197
For i = 1, 2, …, n
1 E ( ˆi ) = 0
2 2 = 2 = ... = 2 = 2
ˆ1 ˆ2 ˆn
3 ˆi ~ N(0, 2 )
Properties of
Residuals
Textbook Pg 197
i. By using Excel, find the best-fit line, the regression equation and
Exercise 4 interpret the slope of the regression line.
ii. How do you relate Exercise 1 and results obtained from i.
25 1.5
20 1.4
1.3
y-variable
y-variable
15
1.2
10 1.1
1
5
0.9
0 0.8
0 10 20 30 40 3.5 4.5 5.5 6.5 7.5
x-variable x-variable
Simple
Linear
Regression
Correlation Regression
Analysis Analysis
Connecting the dots is the way we make information useful ~ Carolyn Sloan
https://www.linkedin.com/pulse/connecting-dots-closer-look-value-learning-how-learn-carolyn-sloan
• Why regression? (LO6 Textbook)
What is the role of DV and Perform the experiment ! Perform the experiment !
IV? Interchange the pair (x,y) to
(y,x). Re-calculate the
correlation coefficient & find
the best-fit line. What do you
notice?
5.4
ASSESSING
LEAST-
SQUARE
REGRESSION
LINE
Textbook Pg 200
VARIATIONS AROUND LEAST-SQUARE
REGRESSION LINE
Total Deviation:
Vertical distance data point to y-bar
Regression analysis involves measuring the amount of variations not considered by the
regression equation, and this variation is known as the unexplained variation/deviation. Textbook Pg 201
VARIATIONS AROUND LEAST-SQUARE REGRESSION LINE
Total Variation
Variation of each observed y around 𝑦.
ഥ
σ 𝒚−𝒚 ഥ 𝟐
y-variable
y-variable
15 1.2
10 1.1
1
5
0.9
0 0.8
0 10 20 30 40 3.5 4.5 5.5 6.5 7.5
x-variable Which plot would give higher x-variable
explained variation component?
Why?
• When the ………….. variation is small, the value of r is close to ………….. . Why?
• If all points fall on the regression line, the unexplained variation will be …… and the
sample correlation coefficient value will be …………. ?
Goodness of Fit : Coefficient of Determination
Textbook Pg 202
Properties of Coefficient of Determination
• If 𝒓𝟐 = 0, then the regression line cannot explain any of the variation which
means that the …………. variable cannot be predicted from the ………….
variable.
Is it possible to get 𝒓𝟐 = 𝟏?
By using Excel and from Exercise 4,
Exercise 5
i. What is the percentage of explained variation and unexplained
variation?
ii. calculate & interpret the coefficient of determination for both plots.
25 1.5
1.4
20
1.3
y-variable
y-variable
15 1.2
10 1.1
1
5
0.9
0 0.8
0 10 20 30 40 3.5 4.5 5.5 6.5 7.5
x-variable x-variable
Solution:
Residuals
Analysis
Textbook Pg 204
Residuals Analysis:
How to Check Assumptions of a Regression Model ?
Mean of the
residuals is always
zero
x-axis can be x-
variable/IV or fitted
value/predicted
value.
Solution:
25 1.5
1.4
20
1.3
y-variable
y-variable
15 1.2
10 1.1
1
5
0.9
0 0.8
0 10 20 30 40 3.5 4.5 5.5 6.5 7.5
x-variable x-variable
Solution:
Which part you need help?
• Variations is everywhere ! What is the difference
QUESTIONS FOR between explained variation vs unexplained
variation?
SELF-REFLECTION
Textbook Pg 200
Hypothesis testing for population slope, 𝛽1
Decision
There is sufficient evidence to conclude that There is insufficient evidence to conclude that
there is a linear relationship between y- there is a linear relationship between y-variable
variable and x-variable. and x-variable.
Remarks:
Always write the conclusion:
• in context of the x-variable & y-variable.
• the significance level.
Exercise 7 By using Excel Data Analysis ToolPax, perform the hypothesis testing for
significance of the linear relationship.
Excel Function: Textbook Pg 216
• The regression line can be used to make predictions for the …………. variable.
• The magnitude of change in one variable when the other variable change exactly
1 unit is called a marginal change represented by the value of …………..
• Do not use the equation for predicting y when the value of x is not in the range of
sample data used to develop the equation. Why?
QUESTIONS FOR
SELF-REFLECTION
F - test
t-test
ANOVA Table
REFRESH YOUR
MIND
The Learning Dots
Connecting the dots is the way we make information useful ~ Carolyn Sloan
https://www.linkedin.com/pulse/connecting-dots-closer-look-value-learning-how-learn-carolyn-sloan
REFLECT & EXPLAIN
Least Square Method Regression Normal Probability Plot
Correlation Regression