Chapter 2 Simple Linear Regression - Jan2023

FEM 2063 - Data Analytics
Chapter 2
At the end of this chapter students
should be able to understand
Simple Linear Regression
1
Overview
2.1 Background
2.2 Introduction
2.3 Regression
2.4 Least Squares Method
2.5 Simple Linear Regression (SLR)
2.6 Software Output
2.7 ANOVA
2.8 Model Evaluation
2.9 Applications/Examples
2
2.1 Background - Regression
 Relation between variables where changes in

some variables may “explain” the changes in
other variables
 Regression model estimates the nature of the
relationship between independent and dependent
variables.
 Dependent variable - Employment income
 Independent variables - hours of work, education,
occupation, sex, age, region, years of experience etc.
3
2.1 Background – Regression Model
 Price of a product and quantity produced:

 Quantity affected by price.
 Dependent variable is quantity of product
 Independent variable is price.
 Price affected by quantity offered for sale.
 Dependent variable is price
 Independent variable is quantity sold.
4
2.1 Background – Types of Regression
Regression
1 Variable Models 2+ Variables
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
5
2.1 Background – Types of Regression
Bivariate or simple
regression model
(Education) x y (Income)
Multivariate or multiple regression model

(Education) x1
(Sex) x2
(Experience) x3 y (Income)
(Age) x4
6
2.2 Introduction – Simple Regression
The quantitative analysis use the

information to predict its future behavior.
Current information is usually in the form
of a set of data.
When the data form a set of pairs of
numbers, we may interpret them as
representing the observed values of an
independent (predictor) variable X and a
dependent (response) variable Y.
7
2.2 Introduction – Simple Regression
Man hours
The goal is to find a functional 180
relation between the response 160
variable y and the predictor 140
variable x. 120
𝑦 = 𝑓 ( 𝑥) 100
80
60
40
20
0
10 20 30 40 50 60 70 80 90
8
2.2 Introduction - Regression Function
 Regard Y as a random
variable.
 For each X, take f (x) to be
the expected value (i.e. mean
value) of y.
 Given that E (Y) denotes the
expected value of Y, call the
equation the regression function.
𝐸(𝑌 )= 𝑓 (𝑥)
9
2.2 Introduction - Regression Application
Three major applications

 Description
 Control
 Prediction
10
2.3 Regression
 Selection of independent variable(s)

Choose the most important predictor variable(s).
 Scope of model
We may need to restrict the coverage of model
to some interval or region of values of the
independent variable(s) depend on the needs/
requirements.
11
2.3 Regression - Population & Sample
12
2.3 Regression - Regression Model
General regression model 𝑌 = 𝛽 +𝛽 𝑋 +𝜀

0 1
where,
0, and 1 are parameters
X is a known constant
Deviations  are independent N(o, 2)
13
2.3 Regression - Regression Coefficients
 The values of the regression parameters 0, and 1 are not

known. We estimate them from data.
 1 indicates the change in the mean response per unit

increase in X.
14
2.3 Regression - Regression Line
 If the scatter plot of our sample data suggests a

linear relationship between two variables i.e.
𝑦 = ^𝛽 0 + ^𝛽 1 𝑥
the relationship can be summarized by a straight-line
plot.
 Least squares method give us the “best” estimated

line for our set of sample data.
15
2.3 What is LR model used for?
Linear regression models are used to show or predict
the relationship between two variables or factors. The
factor that is being predicted is called the dependent
variable.
Example of Linear Regression
Regression analysis is used in statistics to find trends in

data. For example, you might guess that there's a
connection between how much you eat and how much you
weight; regression analysis can help you to quantify that.
© 2019 Petroliam Nasional Berhad (PETRONAS) | 16

• Investigating the dependence of one variable (dependent

variable), on one or more variables (independent
variable) using a straight line.
17
Example of
Linear
Regression
Blood pressure
reading vs stress
test score

2.4 Least Squares Method
The least squares method is a statistical procedure to
find the best fit for a set of data points by minimizing the
sum of the offsets or residuals of points from the plotted
curve. Least squares regression is used to predict the
behavior of dependent variables.
19
20
 ‘Best Fit’ means difference between actual Y values &

predicted Y values is a minimum.
 But positive differences off-set negative ones. So, square
the errors!
LS methods minimizes the Sum of the Squared

Differences (errors) (SSE)
21
22
2.4 Why Linear regression is used?
Regression analysis mathematically describes the

relationship between independent variables and the
dependent variable. It also allows you to predict the mean
value of the dependent variable when you specify values for
the independent variables.
2.4 Linear regression in forecasting

2.4 Assumptions in SLR
 Independent observations: Observations are independent of

each other.
 Linear relationship: The relationship between X and the

mean of Y is linear
 Normal distribution of error terms: the residuals are

normally distributed
 No auto-correlation: The residuals are independent of each

other. No serial correlation in the values of residual .
25
2.5 SLR - Computation
 Write an estimated regression line based on sample data as
^𝑦 = ^𝛽 0 + ^𝛽 1 𝑥
 The method of least squares chooses the values for b0, and
b1 to minimize the sum of squared errors (SSE)
𝑛
𝑆𝑆𝐸 = ∑ ¿ ¿
𝑖 =1
26
2.5 SLR - Computation
27
Example
Example 1:
The manager of a car plant wishes to investigate how the
plant’s electricity usage depends upon the plant production. The
data is given below
Production 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
($million) (x)
Electricity 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
Usage (y)
Estimate the linear regression equation

Example – manual calculations
x 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
=58.62
y 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
=34.15
xy 11.18 8.09 10.65 14.02 16.86 15.22 16.82 20.17 14.24 18.29 13.08 10.63
=169.25
x2 20.34 12.82 18.58 25.60 31.81 24.90 27.98 33.99 22.09 31.47 24.01 17.64
=291.23
Estimated Regression Line
2.5 SLR - Estimation of Mean Response
 Fitted regression line can be used to estimate the mean
value of y for a given value of x.
 Example
 The weekly advertising expenditure (x) and weekly
sales (y) are presented in the following table.
31
2.5 SLR – Estimation of Mean Response
 From the previous table:
𝑛=10 ∑ 𝑥=564 ∑ 𝑥 =32604 2
 The least squares estimates of the regression

coefficients are:
^ 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝛽 1=
𝑛∑ 𝑥 −¿¿¿
2
^
𝛽 0=1436.5 − 10.8(56.4)=828
32
 The estimated regression function is:
^y =828+10.8 x
 This means that if the weekly advertising expenditure is
increased by $1 we would expect the weekly sales to
increase by $10.8.
33
 Fitted values for the sample data are obtained by

substituting the x value into the estimated regression
function.
 For $50 of expenditure, then estimated Sales is:
𝑆𝑎𝑙𝑒𝑠=828+10.8(50)=1368
 This is called the point estimate (forecast) of the mean
response (sales).
34
2.6 Software Output - Example
35
2.7 ANOVA
 ANOVA (Analysis of Variance) is the term for statistical

analyses of the different sources of variation.
 Partitioning of sums of squares and degrees of freedom
associated with the response variable.
 In the regression setting, the observed variation in the
responses (yi) comes from two sources.
2.7 ANOVA – SST, SSE & SSR
 The measure of total variation, denoted by SST, is the sum

of the squared deviations:
𝑆𝑆𝑇 =∑ ¿¿
 If SST = 0, all observations are the same (No variability).
 The greater is SST, the greater is the variation among the y
values.
 In regression model, the measure of variation is that of the y
observations variability around the fitted line:
𝑦 𝑖 − ^𝑦 𝑖
39
 Total Sum of Squares (SST ):

 Measure how much variance is in the dependent
variable.
 Made up of the SSE and SSR
𝐧
𝐒𝐒𝐓 = ∑ ¿ ¿
𝐢=𝟏
SST = SSE + SSR

degree of
freedom:
𝑛 − 1= ( 𝑛− 2 ) +1
40
41
2.7 ANOVA - Mean Squares (MS)
 A sum of squares divided by its degrees of freedom is

called a mean square (MS)
Mean Square Regression (MSR)

𝑆𝑆𝑅

𝑀𝑆𝑅=
1
 Mean Square Error (MSE) 𝑆𝑆𝐸

𝑀𝑆𝐸=
𝑛−2
2.8 Model Evaluation
SLR model evaluation using Software output
(i) Standard error of estimate (s)

(ii) Coefficient of determination (R2)
(iii) Hypothesis test
a)The t-test of the slope
b)The F-test of the slope
43
2.8 Model Evaluation - (i) Standard
error of estimate (s)
𝐒𝐒𝐄
 Compute Standard Error of Estimate by 𝜎
^𝟐=
𝐧−𝟐
where
𝐧
𝐒𝐒𝐄= ∑ ¿ ¿
𝐢=𝟏
 The smaller SSE the more successful is the Linear
Regression Model in explaining y.
45
2.8 Model Evaluation – (ii)
Coefficient of Determination
 Coefficient of determination
2 𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅 = = 1− R2 = 1 - (SSE/SST)
𝑆𝑆𝑇 𝑆𝑆𝑇
proportion of variability in the observed dependent variable
that is explained by the linear regression model.
The coefficient of determination measures the strength of
that linear relationship, denoted by R2
The greater R2 the more successful is the Linear Model
The R2 value close to 1 indicates the best fit and close to 0
indicates the poor fit.
46
2.8 R-squared

2.8 R-squared

2.8 Model Evaluation – (iii.) The
hypothesis test
Hypothesis testing: Decision-making procedure about
the null hypothesis
 The Null Hypothesis (H0):

The hypothesis that cannot be viewed as false
unless
sufficient evidence on the contrary is obtained.
 The Alternative Hypothesis (H1):
The hypothesis against which the null hypothesis is

tested and is viewed true when H0 is declared false.

hypothesis test
Hypothesis test
• A process that uses sample statistics to test a claim about the value
of a population parameter.
• Example: An automobile manufacturer advertises that its new hybrid
car has a mean mileage of 50 miles per gallon. To test this claim, a
sample would be taken. If the sample mean differs enough from the
advertised mean, you can decide the advertisement is wrong.

hypothesis test
• One sided (tailed)
lower-tail test
• One sided (tailed)

upper-tail test
• Two-sided (tailed) test
Note: μ0 is the value given/assumed for the parameter μ.

2.8 Model
Evaluation – (iii.)
The hypothesis
testing

hypothesis test
 Equivalence of F Test and t Test:
For given  level, the F test of 1 = 0 versus 1  0 is
equivalent algebraically to the two-sided t-test.
 Thus, at a given level, we can use either the t-test or

the F-test for testing 1 = 0 versus 1  0.
 The t-test is more flexible since it can be used for one

sided test as well.
2.8 Model Evaluation – (iii) The
hypothesis test (a. t-test)
 t-test to check on adequate relationship between x and y
 Test the hypothesis
H0 : 𝛽 1 = 0 (No relationship between x and y)

𝛽 11:
H ≠ 0 (There is relationship between x and y)
^
𝛽 1 − 𝛽1 ^
𝛽 1 − 𝛽1
𝑇= =
√ ^2
𝜎 𝑠𝑒 (𝑏)
 Test Statistic: T – distribution:
𝑠 𝑠 𝑥𝑥
 Critical Region: |T | > tα/2, n-2 54

.
hypothesis test (b. F-test)
 In order to be able to construct a statistical decision rule,
we need to know the distribution of our test statistic F.
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
 When H0 is true, our test statistic, F, follows the F-
distribution with 1, and n-2 degrees of freedom.
𝐹(𝛼;1,𝑛−2)
hypothesis test (b. F-test)
 This time we will use the F-test. The null and alternative
hypothesis are:
𝐻 0 :𝛽1=0
 Construction of decision rule:
At  = 5% level, Reject H0 if 𝐹> 𝐹(𝛼 ;1 , 𝑛−2)
 Large values of F support Ha and values of F near 1

support H0.
Excel steps
and outputs
2.7 Example 1
The manager of a car plant wishes to investigate how the plant’s

electricity usage depends upon the plant production. The data is
given below estimate the linear regression equation
Production (x) ($M) 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
Electricity Usage (y) 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.5
(kWh) 3
i. Estimate the linear regression equation

ii. Find the standard error of estimate of this regression.
iii. Determine the coefficient of determination of this regression.
iv. Test for significance of regression at 5% significance level.
58
2.7 Example 1
59
Excel results: Regression Line
production Line Fit Plot
4
3.5
f(x) = 0.49883012129795 x + 0.40904819079285

R² = 0.802109396330514
3
2.5
electricity
Linear (electricity )
electricity
2 Predicted electricity
1.5
0.5
0
3 3.5 4 4.5
production 5 5.5 6
Internal
2.7 Example 1
Estimated Regression Line

Electricity usage = 0.4091+0.4988Production
Standard Error of Estimate = 0.173
Coefficient of Determination R2 = 0.802
61
2.7 Example 1
𝛼=0.05 ;𝑡 𝛼/ 2 ,𝑛− 2=𝑡 0.025,10=2.228

, Critical Region: |T | > tα/2, n-2 .
Since 6.37 > 2.228, reject H0 , thus, the distribution of

Electricity usage does depend on level of production
62
2.7 Example 1
 Using F-test. The null and alternative hypothesis are:
𝐻 0 :𝛽1=0
  = 0.05. Since n=12, we require F(0.05; 1,10).
From table F(0.05; 1,10) = 4.96.

Decision rule is: Reject H0 since:
𝐹=40.53>4.96
There is a linear association between the distribution of
Electricity usage and level of production
2.7 Example 1 - Interpretation
• Production coefficient (1 =0.498): Each unit increase in
Production($million) adds 0.498 to Electricity usage.
• 1> 0: (positive relationship): Electricity usage increases
with the increase in Production.
• Intercept coefficient (0 = 0.409): The Electricity usage when
Production equal to zero.
• R Square = 0.802: indicates that the model explains 80% of the
total variability in the electricity usage around its mean. (good fit)
• P-value < 0.05: The regression is significant. The change in
production impacts the electricity usage.
Internal
Excel Results – Example 2
Regression Statistics
Multiple R 0.680322
R Square 0.462837
Adjusted R
Square 0.461716
Standard Error 0.40947
Observations 481
ANOVA
df SS MS F Significance F
Regression 1 69.19926 69.19926 412.7226 1.22E-66
Residual 479 80.31167 0.167665
Total 480 149.5109
Coefficients Standard Error t Stat P-value

Intercept 0.309739 0.019769 15.66798 5.73E-45
Permeability
(md) 0.00171 8.42E-05 20.31558 1.22E-66
Internal
Excel Results – Example 2
Permeability(md) Line Fit Plot

5.00
4.50
4.00
f(x) = 0.00171031072766035 x + 0.309738613245874
3.50 R² = 0.462837451190233
3.00
RQI
2.50 Linear (RQI)
RQI
Predicted RQI
2.00
1.50
1.00
0.50
0.00
0.0 500.0 1000.0 1500.0 2000.0 2500.0
Permeability(md)
Internal
2.5.3a Interpretation of the results
- Example 2
• Permeability(md) coefficient (1=0.0017): Each unit
increase in Permeability adds 0.0017 to RQI value when
all other variables are fixed.
• 1> 0: (positive relationship): QRI increases with the
increase in Permeability.
• Intercept coefficient (0 = 0.309): The value of QRI when
Permeability equal to zero.
• R Square = 0.462837: indicates that the model explains 46% of
the total variability in the RQI values around its mean.
• P-value < 0.05: The regression is significant
Internal
68

Chapter 2 Simple Linear Regression - Jan2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Simple Linear Regression - Jan2023

Uploaded by

Copyright:

Available Formats

FEM 2063 - Data Analytics

Simple Linear Regression

 Relation between variables where changes in

 Price of a product and quantity produced:

 Dependent variable is quantity of product

 Independent variable is price.

 Price affected by quantity offered for sale.

 Dependent variable is price

 Independent variable is quantity sold.

Multivariate or multiple regression model

The quantitative analysis use the

Three major applications

 Selection of independent variable(s)

General regression model 𝑌 = 𝛽 +𝛽 𝑋 +𝜀

Deviations  are independent N(o, 2)

 The values of the regression parameters 0, and 1 are not

 1 indicates the change in the mean response per unit

 If the scatter plot of our sample data suggests a

 Least squares method give us the “best” estimated

Example of Linear Regression

Regression analysis is used in statistics to find trends in

© 2019 Petroliam Nasional Berhad (PETRONAS) | 16

• Investigating the dependence of one variable (dependent

© 2019 Petroliam Nasional Berhad (PETRONAS) | 18

 ‘Best Fit’ means difference between actual Y values &

LS methods minimizes the Sum of the Squared

Regression analysis mathematically describes the

© 2019 Petroliam Nasional Berhad (PETRONAS) | 24

 Independent observations: Observations are independent of

 Linear relationship: The relationship between X and the

 Normal distribution of error terms: the residuals are

 No auto-correlation: The residuals are independent of each

 Write an estimated regression line based on sample data as

Estimate the linear regression equation

sales (y) are presented in the following table.

 From the previous table:

𝑛=10 ∑ 𝑥=564 ∑ 𝑥 =32604 2

 The least squares estimates of the regression

 The estimated regression function is:

 Fitted values for the sample data are obtained by

 For $50 of expenditure, then estimated Sales is:

 ANOVA (Analysis of Variance) is the term for statistical

 The measure of total variation, denoted by SST, is the sum

 Total Sum of Squares (SST ):

SST = SSE + SSR

 A sum of squares divided by its degrees of freedom is

Mean Square Regression (MSR)

 Mean Square Error (MSE) 𝑆𝑆𝐸

SLR model evaluation using Software output

(i) Standard error of estimate (s)

© 2019 Petroliam Nasional Berhad (PETRONAS) | 47

© 2019 Petroliam Nasional Berhad (PETRONAS) | 48

 The Null Hypothesis (H0):

The hypothesis against which the null hypothesis is

© 2019 Petroliam Nasional Berhad (PETRONAS) | 49

advertised mean, you can decide the advertisement is wrong.

• One sided (tailed)

• Two-sided (tailed) test

Note: μ0 is the value given/assumed for the parameter μ.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 51

© 2019 Petroliam Nasional Berhad (PETRONAS) | 52

 Thus, at a given level, we can use either the t-test or

 The t-test is more flexible since it can be used for one