You are on page 1of 9

We add another coefficient b0 (we call it intercept or constant) to the

regression formula so it will be:

Ŷ = b0+ b1 X1 + b2 X2 + b3 X3 + … + bn Xn Linear Regression Model Slope


Intercept or constant Coefficients

Multiple Linear Regression

BA360: Data Mining Montassar Ben Messaoud, PhD 14 BA360: Data Mining Montassar Ben Messaoud, PhD 15

BA360: Data Mining Montassar Ben Messaoud, PhD 16 BA360: Data Mining Montassar Ben Messaoud, PhD 17

The best fitting model is the one that achieves :

Min Sum (Y – Ŷ)2 How to evaluate the prediction power of Regression


Models ?
Ordinary Least Squares (OLS)

BA360: Data Mining Montassar Ben Messaoud, PhD 18 BA360: Data Mining Montassar Ben Messaoud, PhD 19
R-squared = Explained variation / Total variation
R-squared is always between 0% and 100%.
R-squared is a statistical measure that represents the
proportion of the variance for a dependent variable
0% indicates that the model explains none of the
that’s explained by an independent variable.
variability of the response data around its mean.

100% indicates that the model explains all the


variability of the response dat around its mean.
BA360: Data Mining Montassar Ben Messaoud, PhD 20 BA360: Data Mining Montassar Ben Messaoud, PhD 21

ത2
Sum (Y − 𝑌)

Sum Squared of Totals

BA360: Data Mining Montassar Ben Messaoud, PhD 22 BA360: Data Mining Montassar Ben Messaoud, PhD 23

Adjusted R Squared
ത2
SS Total = Sum (Y − 𝑌)
෠2 1−𝑅2 (𝑁−1)
SS Residuals = Sum (Y − 𝑌) R2 adjusted = 1 -
𝑁−𝑃−1
𝑆𝑆𝑟𝑒𝑠
𝑆𝑆𝑟𝑒𝑠 Where R2=Sample R-Square 𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡

𝑅2 = 1 −
P=Number of predictors
N= Total Sample Size
𝑆𝑆𝑡𝑜𝑡
BA360: Data Mining Montassar Ben Messaoud, PhD 24 BA360: Data Mining Montassar Ben Messaoud, PhD 25

Your sale volume = 𝑌

You need to examine the effect of these predictors on this volume


Verifying significance of predictors ▪ Price = X1
▪ Car model = X2
▪ Production year = X3
▪ Mileage = X4

Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 + b4 X4

BA360: Data Mining Montassar Ben Messaoud, PhD 26 BA360: Data Mining Montassar Ben Messaoud, PhD 27

P-value is a statistical value used to determine a significance of a


predictor (X).

If the p-value is less than or equal 0,05 the predictor is significant


and we can keep it in our regression model.

If the p-value is greater than 0,05 the predictor is insignificant and


we should remove it from the regression model.

BA360: Data Mining Montassar Ben Messaoud, PhD 28 BA360: Data Mining Montassar Ben Messaoud, PhD 29

Model with one predictor


variable:

Y= b0+ b1 X + residual

Simple Linear Regression Explained part Unexplained part


of Y of Y (error)

The fitted line is a straight line,


since the model is of 1st order:

Ŷ = b0 + b1 X

BA360: Data Mining Montassar Ben Messaoud, PhD 30 BA360: Data Mining Montassar Ben Messaoud, PhD 31
Multiple Linear Regression

Ŷ = b0 + b1 X1 + b2 X2 + …. + bn Xn
Dummy variables

BA360: Data Mining Montassar Ben Messaoud, PhD 33 BA360: Data Mining Montassar Ben Messaoud, PhD 34

Dummy variables

1. R&D: Numeric

2. Administration: Numeric How do we deal with categorical


3. Advertising: Numeric variables in Regression Models ?
4. City: Categorical (Nominal)

5. Profit: Numeric

BA360: Data Mining Montassar Ben Messaoud, PhD 35 BA360: Data Mining Montassar Ben Messaoud, PhD 36

Step 1 Step 2
No longer use this variable
in regression model

BA360: Data Mining Montassar Ben Messaoud, PhD 37 BA360: Data Mining Montassar Ben Messaoud, PhD 38
Step 3 (Very Important)
Multiple Regression Model

Ŷ = b0 + b1 X1 + b2 X2 + b3 X3

There is a process that should be followed to keep:


Y = Profit X4 = Paris 1- ONLY significant variables
One of them should be
X1 = R & D X5 = Rome skipped in the regression
X2 = Administration X6 = London model
2-The variables that IMPROVE the regression model
X3 = Advertising X7 = Frankfurt

Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 + b4 X4 + b5 X5 + b6 X6
BA360: Data Mining Montassar Ben Messaoud, PhD 39 BA360: Data Mining Montassar Ben Messaoud, PhD 40

Forward Selection
Stepwise Regression Approach is divided into two main methods:
1- Start with no variables in the model
1- Forward selection 2- Test the addition of each variable and find out if is significant by checking its p-value
and if this variable improves the model by checking R-Squared.
2-Backward elimination
3- If yes, keep the variable in the model and try the next variable, and if not remove it
and try the next variable.
4- Keep repeating this process until none improves the model.

BA360: Data Mining Montassar Ben Messaoud, PhD 41 BA360: Data Mining Montassar Ben Messaoud, PhD 42

Example Example
Find out if a variable is significant or not by Find out if a variable is significant or not by
Ŷ checking the p-value of the variable (p<=0.05)
Ŷ checking the p-value of the variable (p<=0.05)
R-Squared (0 < R < 1) R-Squared (0 < R < 1)
Ŷ = b0 + b1 X1 Ŷ = b0 + b1 X1 + b2 X2
Adjusted R-Squared Adjusted R-Squared

Significant Insignificant

BA360: Data Mining Montassar Ben Messaoud, PhD 43 BA360: Data Mining Montassar Ben Messaoud, PhD 44
Example Backward Elimination
Find out if a variable is significant or not by
Ŷ Significant checking the p-value of the variable (p<=0.05) 1- Start with ALL candidate variables in the model
R-Squared (0 < R < 1) 2- Remove the insignificant variable that has the highest p-value from the model and
Ŷ = b0 + b1 X1 + b3 X3 Adjusted R-Squared run the regression again
3- Keep repeating this process until removing all insignificant variables from the model
and only keep the ones that improve the model.

Ŷ = b0 + b1 X1 + b3 X3 4- Use the Model for production

BA360: Data Mining Montassar Ben Messaoud, PhD 45 BA360: Data Mining Montassar Ben Messaoud, PhD 46

Greatest p-value

• Find out if a variable is significant or not by • Find out if a variable is significant or not by
checking the p-value of the variable checking the p-value of the variable
Ŷ = b0 + b1 X1 + b2 X2 + b3 X3 (p<=0.05)
Ŷ = b0 + b1 X1 + b3 X3 (p<=0.05)
• Eliminate the insignificant variable that has • Eliminate the insignificant variable that has
Insignificant Insignificant significant significant
the greatest p-value. the greatest p-value.
• R-Squared (0.0 < R-Sq < 1) • R-Squared (0.0 < R-Sq < 1)
Adjusted R-Squared Adjusted R-Squared
Ŷ = b0 + b1 X1 + b3 X3
BA360: Data Mining Montassar Ben Messaoud, PhD 47 BA360: Data Mining Montassar Ben Messaoud, PhD 48

1/ Linearity

The relatioship between the independent variables and the


dependent one need to be linear.
Assumptions of Multiple Linear
Regression

BA360: Data Mining Montassar Ben Messaoud, PhD 49 BA360: Data Mining Montassar Ben Messaoud, PhD 50
Example of non linearity 2/ Multivariate Normality

BA360: Data Mining Montassar Ben Messaoud, PhD 51 BA360: Data Mining Montassar Ben Messaoud, PhD 52

Multivariate Normality Multivariate Normality

BA360: Data Mining Montassar Ben Messaoud, PhD 53 BA360: Data Mining Montassar Ben Messaoud, PhD 54

Multivariate Normality Multivariate Normality

• ALL variables had to be multivariate normal

When the data is not normally distributed a non-linear transformation


• This assumption can be checked with a histogram or a Q-Q Plot.
(e.g., log-transformation) might fix this issue.

BA360: Data Mining Montassar Ben Messaoud, PhD 55 BA360: Data Mining Montassar Ben Messaoud, PhD 56
3/ No or little Multicollinearity
Multicollinearity may be tested using the Correlation Matrix.

Multicollinearity occurs when the independent variables are too highly correlated with When considering the correlation among all independent
each other variables, the coefficients need to be smaller than 1.

Ŷ = b1 X1 + b2 X2 + b3 X3 Greater than 0,5 is high.


Less than 0,3 might be acceptable.
Price Car Model Year Mileage

BA360: Data Mining Montassar Ben Messaoud, PhD 57 BA360: Data Mining Montassar Ben Messaoud, PhD 58

4/ No auto-correlation between residuals 4/ Homescedasticity

Autocorrelation occuts when the residuals are not independent from each other.
The residuals are equal across the regression line.

BA360: Data Mining Montassar Ben Messaoud, PhD 59 BA360: Data Mining Montassar Ben Messaoud, PhD 60

The following scatter plots show examples of data that are not
homoscedastic (i.e., heteroscedastic):

Polynomial Regression

BA360: Data Mining Montassar Ben Messaoud, PhD 61 BA360: Data Mining Montassar Ben Messaoud, PhD 63
Polynomial Regression Polynomial Regression

BA360: Data Mining Montassar Ben Messaoud, PhD 64 BA360: Data Mining Montassar Ben Messaoud, PhD 65

Polynomial Regression Polynomial Regression

BA360: Data Mining Montassar Ben Messaoud, PhD 66 BA360: Data Mining Montassar Ben Messaoud, PhD 67

You might also like