You are on page 1of 66

FEM 2063 - Data Analytics

Chapter 2
At the end of this chapter students
should be able to understand

Simple Linear Regression

1
Overview
2.1 Background
2.2 Introduction
2.3 Regression
2.4 Least Squares Method
2.5 Simple Linear Regression (SLR)
2.6 Software Output
2.7 ANOVA
2.8 Model Evaluation
2.9 Applications/Examples
2
2.1 Background - Regression

 Relation between variables where changes in


some variables may “explain” the changes in
other variables
 Regression model estimates the nature of the
relationship between independent and dependent
variables.
 Dependent variable - Employment income
 Independent variables - hours of work, education,
occupation, sex, age, region, years of experience etc.

3
2.1 Background – Regression Model

 Price of a product and quantity produced:


 Quantity affected by price.

 Dependent variable is quantity of product

 Independent variable is price.

 Price affected by quantity offered for sale.

 Dependent variable is price

 Independent variable is quantity sold.

4
2.1 Background – Types of Regression

Regression
1 Variable Models 2+ Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

5
2.1 Background – Types of Regression

Bivariate or simple
regression model
(Education) x y (Income)

Multivariate or multiple regression model


(Education) x1
(Sex) x2
(Experience) x3 y (Income)
(Age) x4

6
2.2 Introduction – Simple Regression

The quantitative analysis use the


information to predict its future behavior.
Current information is usually in the form
of a set of data.
When the data form a set of pairs of
numbers, we may interpret them as
representing the observed values of an
independent (predictor) variable X and a
dependent (response) variable Y.

7
2.2 Introduction – Simple Regression

Man hours
The goal is to find a functional 180
relation between the response 160
variable y and the predictor 140
variable x. 120

𝑦 = 𝑓 ( 𝑥) 100

80

60

40

20

0
10 20 30 40 50 60 70 80 90

8
2.2 Introduction - Regression Function

 Regard Y as a random
variable.
 For each X, take f (x) to be
the expected value (i.e. mean
value) of y.
 Given that E (Y) denotes the
expected value of Y, call the
equation the regression function.

𝐸(𝑌 )= 𝑓 (𝑥)
9
2.2 Introduction - Regression Application

Three major applications


 Description
 Control

 Prediction

10
2.3 Regression

 Selection of independent variable(s)


Choose the most important predictor variable(s).

 Scope of model
We may need to restrict the coverage of model
to some interval or region of values of the
independent variable(s) depend on the needs/
requirements.

11
2.3 Regression - Population & Sample

12
2.3 Regression - Regression Model

General regression model 𝑌 = 𝛽 +𝛽 𝑋 +𝜀


0 1
where,
0, and 1 are parameters

X is a known constant

Deviations  are independent N(o, 2)

13
2.3 Regression - Regression Coefficients

 The values of the regression parameters 0, and 1 are not


known. We estimate them from data.

 1 indicates the change in the mean response per unit


increase in X.

14
2.3 Regression - Regression Line

 If the scatter plot of our sample data suggests a


linear relationship between two variables i.e.

𝑦 = ^𝛽 0 + ^𝛽 1 𝑥
the relationship can be summarized by a straight-line
plot.

 Least squares method give us the “best” estimated


line for our set of sample data.

15
2.3 What is LR model used for?
Linear regression models are used to show or predict
the relationship between two variables or factors. The
factor that is being predicted is called the dependent
variable.

Example of Linear Regression

Regression analysis is used in statistics to find trends in


data. For example, you might guess that there's a
connection between how much you eat and how much you
weight; regression analysis can help you to quantify that.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 16


2.3 Regression - Regression Line

• Investigating the dependence of one variable (dependent


variable), on one or more variables (independent
variable) using a straight line.

17
2.3 Regression - Regression Line

Example of
Linear
Regression
Blood pressure
reading vs stress
test score

© 2019 Petroliam Nasional Berhad (PETRONAS) | 18


2.4 Least Squares Method
The least squares method is a statistical procedure to
find the best fit for a set of data points by minimizing the
sum of the offsets or residuals of points from the plotted
curve. Least squares regression is used to predict the
behavior of dependent variables.

19
2.4 Least Squares Method

20
2.4 Least Squares Method

 ‘Best Fit’ means difference between actual Y values &


predicted Y values is a minimum.
 But positive differences off-set negative ones. So, square
the errors!

LS methods minimizes the Sum of the Squared


Differences (errors) (SSE)

21
2.4 Least Squares Method

22
2.4 Why Linear regression is used?

Regression analysis mathematically describes the


relationship between independent variables and the
dependent variable. It also allows you to predict the mean
value of the dependent variable when you specify values for
the independent variables.
© 2019 Petroliam Nasional Berhad (PETRONAS) | 23
2.4 Linear regression in forecasting

© 2019 Petroliam Nasional Berhad (PETRONAS) | 24


2.4 Assumptions in SLR

 Independent observations: Observations are independent of


each other.

 Linear relationship: The relationship between X and the


mean of Y is linear

 Normal distribution of error terms: the residuals are


normally distributed

 No auto-correlation: The residuals are independent of each


other. No serial correlation in the values of residual .
25
2.5 SLR - Computation

 Write an estimated regression line based on sample data as

^𝑦 = ^𝛽 0 + ^𝛽 1 𝑥
 The method of least squares chooses the values for b0, and
b1 to minimize the sum of squared errors (SSE)

𝑛
𝑆𝑆𝐸 = ∑ ¿ ¿
𝑖 =1

26
2.5 SLR - Computation

27
Example
Example 1:
The manager of a car plant wishes to investigate how the
plant’s electricity usage depends upon the plant production. The
data is given below

Production 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
($million) (x)
Electricity 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
Usage (y)

Estimate the linear regression equation


Example – manual calculations

x 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2

=58.62

y 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53

=34.15

xy 11.18 8.09 10.65 14.02 16.86 15.22 16.82 20.17 14.24 18.29 13.08 10.63

=169.25

x2 20.34 12.82 18.58 25.60 31.81 24.90 27.98 33.99 22.09 31.47 24.01 17.64

=291.23
Estimated Regression Line
2.5 SLR - Estimation of Mean Response
 Fitted regression line can be used to estimate the mean
value of y for a given value of x.
 Example
 The weekly advertising expenditure (x) and weekly

sales (y) are presented in the following table.

31
2.5 SLR – Estimation of Mean Response

 From the previous table:

𝑛=10 ∑ 𝑥=564 ∑ 𝑥 =32604 2

 The least squares estimates of the regression


coefficients are:
^ 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝛽 1=
𝑛∑ 𝑥 −¿¿¿
2
^
𝛽 0=1436.5 − 10.8(56.4)=828
32
2.5 SLR – Estimation of Mean Response

 The estimated regression function is:

^y =828+10.8 x
 This means that if the weekly advertising expenditure is
increased by $1 we would expect the weekly sales to
increase by $10.8.

33
2.5 SLR – Estimation of Mean Response

 Fitted values for the sample data are obtained by


substituting the x value into the estimated regression
function.

 For $50 of expenditure, then estimated Sales is:

𝑆𝑎𝑙𝑒𝑠=828+10.8(50)=1368
 This is called the point estimate (forecast) of the mean
response (sales).

34
2.6 Software Output - Example

35
2.7 ANOVA

 ANOVA (Analysis of Variance) is the term for statistical


analyses of the different sources of variation.
 Partitioning of sums of squares and degrees of freedom
associated with the response variable.
 In the regression setting, the observed variation in the
responses (yi) comes from two sources.
2.7 ANOVA – SST, SSE & SSR

 The measure of total variation, denoted by SST, is the sum


of the squared deviations:

𝑆𝑆𝑇 =∑ ¿¿
 If SST = 0, all observations are the same (No variability).
 The greater is SST, the greater is the variation among the y
values.
 In regression model, the measure of variation is that of the y
observations variability around the fitted line:

𝑦 𝑖 − ^𝑦 𝑖
2.7 ANOVA – SST, SSE & SSR

39
2.7 ANOVA – SST, SSE & SSR

 Total Sum of Squares (SST ):


 Measure how much variance is in the dependent
variable.
 Made up of the SSE and SSR
𝐧
𝐒𝐒𝐓 = ∑ ¿ ¿
𝐢=𝟏

SST = SSE + SSR


degree of
freedom:
𝑛 − 1= ( 𝑛− 2 ) +1
40
2.7 ANOVA – SST, SSE & SSR

41
2.7 ANOVA - Mean Squares (MS)

 A sum of squares divided by its degrees of freedom is


called a mean square (MS)

Mean Square Regression (MSR)


𝑆𝑆𝑅

𝑀𝑆𝑅=
1

 Mean Square Error (MSE) 𝑆𝑆𝐸


𝑀𝑆𝐸=
𝑛−2
2.8 Model Evaluation

SLR model evaluation using Software output

(i) Standard error of estimate (s)


(ii) Coefficient of determination (R2)
(iii) Hypothesis test
a)The t-test of the slope
b)The F-test of the slope

43
2.8 Model Evaluation - (i) Standard
error of estimate (s)
𝐒𝐒𝐄
 Compute Standard Error of Estimate by 𝜎
^𝟐=
𝐧−𝟐
where
𝐧
𝐒𝐒𝐄= ∑ ¿ ¿
𝐢=𝟏
 The smaller SSE the more successful is the Linear
Regression Model in explaining y.

45
2.8 Model Evaluation – (ii)
Coefficient of Determination
 Coefficient of determination
2 𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅 = = 1− R2 = 1 - (SSE/SST)
𝑆𝑆𝑇 𝑆𝑆𝑇
proportion of variability in the observed dependent variable
that is explained by the linear regression model.
The coefficient of determination measures the strength of
that linear relationship, denoted by R2
The greater R2 the more successful is the Linear Model
The R2 value close to 1 indicates the best fit and close to 0
indicates the poor fit.

46
2.8 R-squared

© 2019 Petroliam Nasional Berhad (PETRONAS) | 47


2.8 R-squared

© 2019 Petroliam Nasional Berhad (PETRONAS) | 48


2.8 Model Evaluation – (iii.) The
hypothesis test
Hypothesis testing: Decision-making procedure about
the null hypothesis

 The Null Hypothesis (H0):


The hypothesis that cannot be viewed as false
unless
sufficient evidence on the contrary is obtained.
 The Alternative Hypothesis (H1):

The hypothesis against which the null hypothesis is


tested and is viewed true when H0 is declared false.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 49


2.8 Model Evaluation – (iii.) The
hypothesis test

Hypothesis test
• A process that uses sample statistics to test a claim about the value
of a population parameter.
• Example: An automobile manufacturer advertises that its new hybrid
car has a mean mileage of 50 miles per gallon. To test this claim, a
sample would be taken. If the sample mean differs enough from the
© 2019 Petroliam Nasional Berhad (PETRONAS) | 50

advertised mean, you can decide the advertisement is wrong.


2.8 Model Evaluation – (iii.) The
hypothesis test
• One sided (tailed)
lower-tail test

• One sided (tailed)


upper-tail test

• Two-sided (tailed) test

Note: μ0 is the value given/assumed for the parameter μ.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 51


2.8 Model
Evaluation – (iii.)
The hypothesis
testing

© 2019 Petroliam Nasional Berhad (PETRONAS) | 52


2.8 Model Evaluation – (iii.) The
hypothesis test
 Equivalence of F Test and t Test:
For given  level, the F test of 1 = 0 versus 1  0 is
equivalent algebraically to the two-sided t-test.

 Thus, at a given level, we can use either the t-test or


the F-test for testing 1 = 0 versus 1  0.

 The t-test is more flexible since it can be used for one


sided test as well.
2.8 Model Evaluation – (iii) The
hypothesis test (a. t-test)
 t-test to check on adequate relationship between x and y

 Test the hypothesis

H0 : 𝛽 1 = 0 (No relationship between x and y)


𝛽 11:
H ≠ 0 (There is relationship between x and y)
^
𝛽 1 − 𝛽1 ^
𝛽 1 − 𝛽1
𝑇= =

√ ^2
𝜎 𝑠𝑒 (𝑏)
 Test Statistic: T – distribution:
𝑠 𝑠 𝑥𝑥

 Critical Region: |T | > tα/2, n-2 54


.
2.8 Model Evaluation – (iii) The
hypothesis test (b. F-test)
 In order to be able to construct a statistical decision rule,
we need to know the distribution of our test statistic F.
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
 When H0 is true, our test statistic, F, follows the F-
distribution with 1, and n-2 degrees of freedom.

𝐹(𝛼;1,𝑛−2)
2.8 Model Evaluation – (iii) The
hypothesis test (b. F-test)
 This time we will use the F-test. The null and alternative
hypothesis are:
𝐻 0 :𝛽1=0
 Construction of decision rule:
At  = 5% level, Reject H0 if 𝐹> 𝐹(𝛼 ;1 , 𝑛−2)

 Large values of F support Ha and values of F near 1


support H0.
Excel steps
and outputs
2.7 Example 1

The manager of a car plant wishes to investigate how the plant’s


electricity usage depends upon the plant production. The data is
given below estimate the linear regression equation
Production (x) ($M) 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2

Electricity Usage (y) 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.5
(kWh) 3

i. Estimate the linear regression equation


ii. Find the standard error of estimate of this regression.
iii. Determine the coefficient of determination of this regression.
iv. Test for significance of regression at 5% significance level.

58
2.7 Example 1

59
Excel results: Regression Line
production Line Fit Plot
4

3.5

f(x) = 0.49883012129795 x + 0.40904819079285


R² = 0.802109396330514
3

2.5
electricity
Linear (electricity )
electricity

2 Predicted electricity

1.5

0.5

0
3 3.5 4 4.5
production 5 5.5 6

© 2019 Petroliam Nasional Berhad (PETRONAS) | 60

Internal
2.7 Example 1

Estimated Regression Line


Electricity usage = 0.4091+0.4988Production
Standard Error of Estimate = 0.173
Coefficient of Determination R2 = 0.802

61
2.7 Example 1

𝛼=0.05 ;𝑡 𝛼/ 2 ,𝑛− 2=𝑡 0.025,10=2.228


, Critical Region: |T | > tα/2, n-2 .

Since 6.37 > 2.228, reject H0 , thus, the distribution of


Electricity usage does depend on level of production

62
2.7 Example 1
 Using F-test. The null and alternative hypothesis are:

𝐻 0 :𝛽1=0
  = 0.05. Since n=12, we require F(0.05; 1,10).

From table F(0.05; 1,10) = 4.96.


Decision rule is: Reject H0 since:
𝐹=40.53>4.96
There is a linear association between the distribution of
Electricity usage and level of production
2.7 Example 1 - Interpretation
• Production coefficient (1 =0.498): Each unit increase in
Production($million) adds 0.498 to Electricity usage.
• 1> 0: (positive relationship): Electricity usage increases
with the increase in Production.
• Intercept coefficient (0 = 0.409): The Electricity usage when
Production equal to zero.
• R Square = 0.802: indicates that the model explains 80% of the
total variability in the electricity usage around its mean. (good fit)
• P-value < 0.05: The regression is significant. The change in
production impacts the electricity usage.
© 2019 Petroliam Nasional Berhad (PETRONAS) | 64

Internal
Excel Results – Example 2
Regression Statistics
Multiple R 0.680322
R Square 0.462837
Adjusted R
Square 0.461716
Standard Error 0.40947
Observations 481

ANOVA
  df SS MS F Significance F
Regression 1 69.19926 69.19926 412.7226 1.22E-66
Residual 479 80.31167 0.167665
Total 480 149.5109      

  Coefficients Standard Error t Stat P-value


Intercept 0.309739 0.019769 15.66798 5.73E-45
Permeability
(md) 0.00171 8.42E-05 20.31558 1.22E-66
© 2019 Petroliam Nasional Berhad (PETRONAS) | 65

Internal
Excel Results – Example 2

Permeability(md) Line Fit Plot


5.00

4.50

4.00
f(x) = 0.00171031072766035 x + 0.309738613245874
3.50 R² = 0.462837451190233

3.00
RQI
2.50 Linear (RQI)
RQI

Predicted RQI
2.00

1.50

1.00

0.50

0.00
0.0 500.0 1000.0 1500.0 2000.0 2500.0

Permeability(md)
© 2019 Petroliam Nasional Berhad (PETRONAS) | 66

Internal
2.5.3a Interpretation of the results
- Example 2
• Permeability(md) coefficient (1=0.0017): Each unit
increase in Permeability adds 0.0017 to RQI value when
all other variables are fixed.
• 1> 0: (positive relationship): QRI increases with the
increase in Permeability.
• Intercept coefficient (0 = 0.309): The value of QRI when
Permeability equal to zero.
• R Square = 0.462837: indicates that the model explains 46% of
the total variability in the RQI values around its mean.
• P-value < 0.05: The regression is significant
© 2019 Petroliam Nasional Berhad (PETRONAS) | 67

Internal
68

You might also like