Linear Regression Model Slope: Ŷ B + B X + B X + B X + + B X

We add another coefficient b0 (we call it intercept or constant) to the
regression formula so it will be:
Ŷ = b0+ b1 X1 + b2 X2 + b3 X3 + … + bn Xn Linear Regression Model Slope

Intercept or constant Coefficients
Multiple Linear Regression
BA360: Data Mining Montassar Ben Messaoud, PhD 14 BA360: Data Mining Montassar Ben Messaoud, PhD 15
The best fitting model is the one that achieves :
Min Sum (Y – Ŷ)2 How to evaluate the prediction power of Regression

Models ?
Ordinary Least Squares (OLS)
R-squared = Explained variation / Total variation
R-squared is always between 0% and 100%.
R-squared is a statistical measure that represents the
proportion of the variance for a dependent variable
0% indicates that the model explains none of the
that’s explained by an independent variable.
variability of the response data around its mean.
100% indicates that the model explains all the

variability of the response dat around its mean.
ത2
Sum (Y − 𝑌)
Sum Squared of Totals
Adjusted R Squared
ത2
SS Total = Sum (Y − 𝑌)
෠2 1−𝑅2 (𝑁−1)
SS Residuals = Sum (Y − 𝑌) R2 adjusted = 1 -
𝑁−𝑃−1
𝑆𝑆𝑟𝑒𝑠
𝑆𝑆𝑟𝑒𝑠 Where R2=Sample R-Square 𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡
𝑅2 = 1 −
P=Number of predictors
N= Total Sample Size
𝑆𝑆𝑡𝑜𝑡
෠
Your sale volume = 𝑌
You need to examine the effect of these predictors on this volume

Verifying significance of predictors ▪ Price = X1
▪ Car model = X2
▪ Production year = X3
▪ Mileage = X4
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 + b4 X4
P-value is a statistical value used to determine a significance of a

predictor (X).
If the p-value is less than or equal 0,05 the predictor is significant

and we can keep it in our regression model.
If the p-value is greater than 0,05 the predictor is insignificant and

we should remove it from the regression model.
Model with one predictor

variable:
Y= b0+ b1 X + residual
Simple Linear Regression Explained part Unexplained part

of Y of Y (error)
The fitted line is a straight line,

since the model is of 1st order:
Ŷ = b0 + b1 X
Multiple Linear Regression
Ŷ = b0 + b1 X1 + b2 X2 + …. + bn Xn
Dummy variables
Dummy variables
1. R&D: Numeric
2. Administration: Numeric How do we deal with categorical

3. Advertising: Numeric variables in Regression Models ?
4. City: Categorical (Nominal)
5. Profit: Numeric
Step 1 Step 2
No longer use this variable
in regression model
Step 3 (Very Important)
Multiple Regression Model
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3
There is a process that should be followed to keep:

Y = Profit X4 = Paris 1- ONLY significant variables
One of them should be
X1 = R & D X5 = Rome skipped in the regression
X2 = Administration X6 = London model
2-The variables that IMPROVE the regression model
X3 = Advertising X7 = Frankfurt
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 + b4 X4 + b5 X5 + b6 X6
Forward Selection
Stepwise Regression Approach is divided into two main methods:
1- Start with no variables in the model
1- Forward selection 2- Test the addition of each variable and find out if is significant by checking its p-value
and if this variable improves the model by checking R-Squared.
2-Backward elimination
3- If yes, keep the variable in the model and try the next variable, and if not remove it
and try the next variable.
4- Keep repeating this process until none improves the model.
Example Example
Find out if a variable is significant or not by Find out if a variable is significant or not by
Ŷ checking the p-value of the variable (p<=0.05)
Ŷ checking the p-value of the variable (p<=0.05)
R-Squared (0 < R < 1) R-Squared (0 < R < 1)
Ŷ = b0 + b1 X1 Ŷ = b0 + b1 X1 + b2 X2
Adjusted R-Squared Adjusted R-Squared
Significant Insignificant
Example Backward Elimination
Find out if a variable is significant or not by
Ŷ Significant checking the p-value of the variable (p<=0.05) 1- Start with ALL candidate variables in the model
R-Squared (0 < R < 1) 2- Remove the insignificant variable that has the highest p-value from the model and
Ŷ = b0 + b1 X1 + b3 X3 Adjusted R-Squared run the regression again
3- Keep repeating this process until removing all insignificant variables from the model
and only keep the ones that improve the model.
Ŷ = b0 + b1 X1 + b3 X3 4- Use the Model for production
Greatest p-value
• Find out if a variable is significant or not by • Find out if a variable is significant or not by
checking the p-value of the variable checking the p-value of the variable
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 (p<=0.05)
Ŷ = b0 + b1 X1 + b3 X3 (p<=0.05)
• Eliminate the insignificant variable that has • Eliminate the insignificant variable that has
Insignificant Insignificant significant significant
the greatest p-value. the greatest p-value.
• R-Squared (0.0 < R-Sq < 1) • R-Squared (0.0 < R-Sq < 1)
Adjusted R-Squared Adjusted R-Squared
Ŷ = b0 + b1 X1 + b3 X3
1/ Linearity
The relatioship between the independent variables and the

dependent one need to be linear.
Assumptions of Multiple Linear
Regression
Example of non linearity 2/ Multivariate Normality
Multivariate Normality Multivariate Normality
Multivariate Normality Multivariate Normality
• ALL variables had to be multivariate normal
When the data is not normally distributed a non-linear transformation

• This assumption can be checked with a histogram or a Q-Q Plot.
(e.g., log-transformation) might fix this issue.
3/ No or little Multicollinearity
Multicollinearity may be tested using the Correlation Matrix.
Multicollinearity occurs when the independent variables are too highly correlated with When considering the correlation among all independent
each other variables, the coefficients need to be smaller than 1.
Ŷ = b1 X1 + b2 X2 + b3 X3 Greater than 0,5 is high.

Less than 0,3 might be acceptable.
Price Car Model Year Mileage
4/ No auto-correlation between residuals 4/ Homescedasticity
Autocorrelation occuts when the residuals are not independent from each other.
The residuals are equal across the regression line.
The following scatter plots show examples of data that are not
homoscedastic (i.e., heteroscedastic):
Polynomial Regression
Polynomial Regression Polynomial Regression
Polynomial Regression Polynomial Regression

Linear Regression Model Slope: Ŷ B + B X + B X + B X + + B X

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Model Slope: Ŷ B + B X + B X + B X + + B X

Uploaded by

Copyright:

Available Formats

We add another coefficient b0 (we call it intercept or constant) to the

regression formula so it will be:

Ŷ = b0+ b1 X1 + b2 X2 + b3 X3 + … + bn Xn Linear Regression Model Slope

Multiple Linear Regression

The best fitting model is the one that achieves :

Min Sum (Y – Ŷ)2 How to evaluate the prediction power of Regression

100% indicates that the model explains all the

Sum Squared of Totals

You need to examine the effect of these predictors on this volume

P-value is a statistical value used to determine a significance of a

If the p-value is less than or equal 0,05 the predictor is significant

If the p-value is greater than 0,05 the predictor is insignificant and

Model with one predictor

Simple Linear Regression Explained part Unexplained part

The fitted line is a straight line,

2. Administration: Numeric How do we deal with categorical

There is a process that should be followed to keep:

Ŷ = b0 + b1 X1 + b3 X3 4- Use the Model for production

The relatioship between the independent variables and the

Multivariate Normality Multivariate Normality

Multivariate Normality Multivariate Normality

• ALL variables had to be multivariate normal

When the data is not normally distributed a non-linear transformation

Ŷ = b1 X1 + b2 X2 + b3 X3 Greater than 0,5 is high.

4/ No auto-correlation between residuals 4/ Homescedasticity

Polynomial Regression Polynomial Regression

You might also like