You are on page 1of 47

FEM 2063 - Data Analytics

Chapter 3
At the end of this chapter
students should be able to
understand

Multiple Linear Regression

1
Overview

 3.1 Background
 3.2 Multiple Linear Regression (MLR)

 3.3 Software Output

 3.4 ANOVA

 3.5 Model Evaluation

 3.6 Application/Examples

2
3.1 Background
Simple regression considers the
relation between a single
independent variable and
dependent variable

Multiple regression
simultaneously considers the
influence of multiple independent
variables on a dependent variable Y

3
3.1 Background
 A simple regression model
fits a regression line in 2-
dimensional space

 A multiple regression model


with two independent
variables fits a regression
plane in 3-dimensional space

4
3.2 Multiple Linear Regression

Regression coefficients are estimated by minimizing SSE to


derive this model:

Again, estimates for the multiple slope coefficients are


derived by minimizing SSE derive this multiple regression
model:
+…

5
3.2 Multiple Linear Regression
Multiple linear regression (MLR), also known simply
as multiple regression, is a statistical technique that
uses several explanatory variables to predict the outcome
of a response variable. Multiple regression is an
extension of linear (OLS) regression that uses just one
explanatory variable.

6
3.2 Multiple Linear Regression
 The value of the dependent variable yi is modeled as

 The dependent variable is related to k independent


variables.

 As in SLR, the parameters of MLR (𝛽 ¿ ¿0, 𝛽 ,..., 𝛽 )¿ also


1 𝑘
estimated using the method of least squares.

 However, it would be tedious to find these values by


hand, thus we use the computer to handle the
computations. 7
3.2 Multiple Linear Regression (MLR)
Model

© 2019 Petroliam Nasional Berhad (PETRONAS) | 8

Internal
3.2 Multiple Linear Regression
(MLR) Model

9
3.2 What is multiple
regression analysis used for?
Multiple regression analysis allows researchers to assess
the strength of the relationship between an outcome (the
dependent variable) and several predictor variables as well
as the importance of each of the predictors to the
relationship, often with the effect of other predictors
statistically eliminated.

10
3.2 Assumptions of Multiple
Linear Regression

11
3.2 Multiple Linear Regression
(MLR) – Types of Regression

12
3.3 Software Output - sample

The software (Excel) output

Part 3. Reg Statistics

Part 2. ANOVA

Part 1. Regression
analysis

13
3.4 Testing the significance of
Regression – t and F tests
Use of F-Test
Use of t-Test

© 2019 Petroliam Nasional Berhad (PETRONAS) | 14


3.4 Testing the significance
of Regression - SSE

Standard error of the Regression (S) : represents the


average distance that the observed values fall from the
regression line. Conveniently, it tells you how wrong the
regression model is on average using the units of the
response variable. Smaller values are better because it
indicates that the observations are closer to the fitted line.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 17


3.5 Model Evaluation - (i)
Standard error of estimate (s)
 Compute Standard Error of Estimate by

𝐒𝐒𝐄
𝜎^ =
𝟐
𝐧− 𝑘−1
 This is an unbiased estimator for s (for Population)
2

 The smaller SSE the more successful is the Multiple


Linear Regression Model in explaining y.

18
3.5 Model Evaluation – (ii)
Coefficient of Determination
 Coefficient of determination
2 𝑆𝑆𝑇 −𝑆𝑆𝐸 𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅= = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
 proportion of variability in the observed dependent variable
that is explained by the MLR model.
 The coefficient of determination measures the strength of
that linear relationship, denoted by R2
 The greater R2 the more successful is the MLR Model

19
3.5 Model Evaluation – (iii) The
hypothesis test of the slope (t-test)

 The t-test addresses the adequate relationship between


xi and y exists.

 Test the hypothesis


H0 : 𝛽 = 0 (No relationship between xi and y)
𝑖

H
𝛽 𝑖1: ≠ 0 (There is relationship between xi and y)
^
𝛽𝑖 − 𝛽𝑖 ^
𝛽𝑖 − 𝛽𝑖
𝑇= =


 Test Statistic: T – distribution: ^2
𝜎 𝑠𝑒 ( ^
𝛽)
𝑖
𝑠 𝑠𝑥𝑥
 Critical Region: |T | > tα/2, n-k-1 .
20
3.5 Model Evaluation – (iv) The
hypothesis test (p-value)
The p-value conveys information about the weight of
evidence against H0. The smaller the p–value, the greater the
evidence against H0.
When the p–value is small enough we shall reject the null
hypothesis H0 .

© 2019 Petroliam Nasional Berhad (PETRONAS) | 21


3.5 Model Evaluation – (iii) The
hypothesis test of the slope (t-test)

The t – test is used to test for inference on individual


regression coefficient.

22
3.5 Model Evaluation – (iii) Testing the
significance of regression (F-test)

Hypotheses:

𝑀 𝑆𝑅
Test statistic: 𝐹 0=
𝑀 𝑆𝐸
where: 𝑆𝑆𝑅 𝑆𝑆𝐸
𝑀 𝑆𝑅 = , 𝑀 𝑆 𝐸=
𝑘 𝑛−𝑘 −1

Rejection criteria: 𝑀 𝑆𝑅
𝐹 0= > 𝑓 𝛼 ,𝑘,𝑛 −𝑘−1
𝑀 𝑆𝐸
23
3.5 Model Evaluation – (iii) Testing
the significance of regression (F-test)

 The F – test is used to test for inference on


multiple linear regression model

24
3.5 Application/Examples Wire
Bond Pull Strength Data

25
Wire Bond Pull Strength Data

I. Estimate the Multiple linear regression (MLR) equation


II. Find the standard error of estimate of this MLR.
III. Determine the coefficient of determination of this MLR.
IV. Test for significance of Slopes at 5% significance level.
V. Test for significance of MLR at 5% significance level.

26
Wire Bond Pull Strength Data
Regression Statistics
Multiple R 0.990523843
R Square 0.981137483

Adjusted R Square 0.979422709


Standard Error 2.288046833
Observations 25

ANOVA
  df SS MS F Significance F
Regression 2 5990.771221 2995.385611 572.1672 1.07546E-19
Residual 22 115.1734828 5.235158308
Total 24 6105.944704      

Upper
  Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% 95.0%
Intercept 2.263791434 1.060066238 2.135518851 0.044099 0.065348613 4.462234256 0.06534861 4.462234
X Variable 1 2.744269643 0.093523844 29.34299438 3.91E-19 2.550313061 2.938226226 2.55031306 2.938226
X Variable 2 0.012527811 0.002798419 4.476746229 0.000188 0.006724246 0.018331377 0.00672425 0.018331

27
Wire Bond Pull Strength Data

The Estimated Multiple Linear regression equation is


Strength = 2.26 + 2.74*Length + 0.0125 Height

 Standard error of estimate (s) = 2.288

 Coefficient of determination (R2) = 98.1%

28
Wire Bond Pull Strength Data

H0: 𝛽𝑖 = 0 (No relationship between xi and y)


𝛽H𝑖 1: ≠ 0 (There is at least one relationship
between xi and y)

29
Wire Bond Pull Strength Data

Test Statistic: (From the ANOVA table)


Critical Value tα/2, n-p = t0.05/2, 22 = 2.074 (from statistical table)

Conclusion
Since > 2.074, we reject H0 ,and conclude that pull strength is
linearly related to wire length and die height

30
Wire Bond Pull Strength Data

Hypotheses:

2995.4/5.2 = 572.17
Test statistic:

31
Wire Bond Pull Strength Data

𝑀 𝑆𝑅
Rejection criteria: 𝐹 0= > 𝑓 𝛼 ,𝑘,𝑛 −𝑘−1
𝑀 𝑆𝐸
Let  = 0.05. Since k = 2, n-p =22, we require to find
F(0.05,2,22).

From table we find that F(0.05, 2, 22) = 3.44.

Conclusion

Since 572.17 > 3.44 we Reject H0 and conclude that pull


strength is linearly related to either wire length or die height or
both
32
Example 2 – p - value
Example 2
Correlation

Correlation is a statistical measure that expresses the


extent to which two variables are linearly related (meaning
they change together at a constant rate).
Correlation
Excel steps and
outputs
2.5b Application of MLR to Hydraulic-
calibration data

Example: Given the data on Depth, Gain Density, Porosity,


Permeability and Reservoir Quality Index (RQI), Investigate
the dependence of RQI (Y) on these four factors: Depth
(X1), Gain Density (X2), Porosity (X3) and Permeability
(X4). Given n, number of observations is 481.
H0 : βj = 0 ( there is no relationship between x and y)

H1: βj ≠ 0 (the straight-line model is adequate or at


least one of the βj not equal to 0)

© 2019 Petroliam Nasional Berhad (PETRONAS) | 38


Regression Statistics
Multiple R 0.700986
Excel Results
R Square 0.491381
Adjusted R Interpret the results
Square 0.487107
Standard Error 0.399695
Observations 481
ANOVA
  df SS MS F Significance F
Regression 4 73.46689 18.36672 114.9671 1.56E-68
Residual 476 76.04403 0.159756
Total 480 149.5109      

  Coefficients Standard Error t Stat P-value


Intercept 2.175239 1.273593 1.707955 0.088297
Depth (ft) -0.00012 2.95E-05 -3.97143 8.25E-05
Gain Density
(gm/cc) -0.30763 0.441242 -0.69719 0.486024
Porosity % -0.00843 0.002236 -3.77155 0.000183
Permeability(
md) 0.001762 8.73E-05 20.18564 6.43E-66

© 2019 Petroliam Nasional Berhad (PETRONAS) | 39


Regression Statistics
Multiple R 0.700986
Excel Results
R Square 0.491381
Adjusted R
Square 0.487107
Standard Error 0.399695
Observations 481

ANOVA
  df SS MS F Significance F
Regression 4 73.46689 18.36672 114.9671 1.56E-68
Residual 476 76.04403 0.159756
Total 480 149.5109      

  Coefficients Standard Error t Stat P-value


Intercept 2.175239 1.273593 1.707955 0.088297
Depth (ft) -0.00012 2.95E-05 -3.97143 8.25E-05
Gain Density
(gm/cc) -0.30763 0.441242 -0.69719 0.486024
Porosity % -0.00843 0.002236 -3.77155 0.000183
Permeability(
md) 0.001762 8.73E-05 20.18564 6.43E-66
© 2019 Petroliam Nasional Berhad (PETRONAS) | 40
Interpretation of the results

o The regression coefficient for Gain Density is not


significant (P-value>0.05).

IQR= 1.301 – 0.00011 (Depth) -0.30763 (Gain Density)


– 0.0088(Porosity) + 0.00017 (Permeability)

o The Gain density has no impact on the RQI.


o Remove Gain Density and re-run the regression.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 41


Excel Results of MLR After removing Gain Density
Regression Statistics
Multiple R 0.700615
R Square 0.490862
Adjusted R
Square 0.48766
Standard Error 0.39948
Observations 481

ANOVA
  df SS MS F Significance F
Regression 3 73.38924 24.46308 153.2926 1.47E-69
Residual 477 76.12169 0.159584
Total 480 149.5109      

  Coefficients Standard Error t Stat P-value Lower 95%


Intercept 1.301679 0.228134 5.705759 2.04E-08 0.853407
Depth (ft) -0.00011 2.82E-05 -3.94235 9.28E-05 -0.00017
Porosity % -0.00882 0.002167 -4.06987 5.51E-05 -0.01308
Permeability
(md) 0.001764 8.72E-05 20.22245 3.97E-66 0.001593
© 2019 Petroliam Nasional Berhad (PETRONAS) | 42
Interpretation of the results

The R square value does not change much which means that
removing Gain Density does not affect the model.

Estimated Regression Equation

IQR= 1.301 – 0.00011 (Depth) – 0.0088(Porosity) +


0.00017 (Permeability)

© 2019 Petroliam Nasional Berhad (PETRONAS) | 43


Excel Results – please interpret
How Linear & Non-Linear regression dependent
on the data
A linear regression equation simply sums the terms.
While the model must be linear in the parameters, you can
raise an independent variable by an exponent to fit a curve.
For instance, you can include a squared or cubed
term. Nonlinear regression models are anything that
doesn't follow this one form. R-squared will reflect the
linearity.

© 2019 Petroliam Nasional Berhad (PETRONAS) | 46


Linear vs Non-Linear regressions
Types of regression

• Multiple Linear Regression:

• Polynomial Linear Regression:

• Linear Regression:

• Nonlinear Regression:

Linear or nonlinear in parameters

© 2019 Petroliam Nasional Berhad (PETRONAS) | 47


Exercise
A set of experimental runs were made to determine a way of
predicting cooking time y at various levels of oven width x1, and
temperature x2. The data were recorded as follows:

i. Estimate the Multiple linear regression (MLR)


equation
ii. Find the standard error of estimate of this
MLR.
iii. Determine the coefficient of determination of
this MLR.
iv. Test for significance of Slopes at 1%
significance level.
v. Test for significance of MLR at 1%
significance level.
49

You might also like