Professional Documents
Culture Documents
Introduction to Regression
• Linear Regression is a Supervised Learning algorithm
• It determines both the nature and strength of relationships with two variables
• Multiple Linear Regression: Predict Y by considering more than one independent variable
– Estimated y = b0 + b1*x1 + b2*x2
Simple Linear Regression
Marketing Insurance term
Budget (X) (Y) amount
(In lakhs) (in Crs)
23 5
64 15
67 16
45 10
19 4.5
56 12
65 15.5
35 5.6
45 4.9
24 5.01
46 10
76 15.2
45 11
38 7
34 6
32 5.5 Line Equation:
56 12 Insurance term = Slope X Marketing
budget + Intercept
Question
• Which of the following methods do we use to find the best fit line for data in
Linear Regression?
Key • How well does the best-fit line represent the scatter-plot?
questions:
• How well does the best-fit line predict the new data?
Simple Linear Regression –
Strength of best fit line
• R-squared is a best metric to measure the strength of the best fit
Strength: line
Best fit line
• R-squared is a statistical measure of how close the data are to the
Example
•Dependent variable: Insurance amount
•Independent variable: Marketing Budget
•RSS = 37.12
•TSS = 292.42
•R-squared = 0.87
•R-squared-0.87 indicates that 87% of the
variation in ‘insurance amount’ is explained by
the independent variable, ‘Marketing Budget’
SLR- Equation Interpretation
Interpretation
of coefficients : Equation is -
Best fit line • Insurance_amount =
0.239*Marketing_budget -1.379
• β0 = -1.379
• β1 = 0.239
• It can help guard against overfitting (including regressors that are not really useful)
• Particularly for small N and where results are to be generalized, take more note of adjusted
R2
• Adjusted R-squared is a better metric than R-squared to assess how good the model fits the
data.
MLR Framework Variable
Selection for
model
Explorat
ory Final
Data Model
Business Analysis
Objective
Data
Preparat
ion
Multicolline
arity
Variance Inflation
Factor
Stepwise Variable Selection
Adding one variable by variable
No Variable
and checking p- value
• Let say, there are three variables in the dataset i.e. X1,X2 and X3. There is strong
correlation between these variables?
• What can be a problem if, for example, brand of laptop are coded as follows: HP=1, Dell=2,
Lenovo=3, Asus=4, Acer=5
Dummy Variables
• The correct way to encode the categorical variables is by using dummy variables
• If a categorical variable has n possible values, then create n-1 dummy variables
• For example, if Laptop brand can take following 5 values HP, Dell, Lenovo, Asus, Acer; then
create 4 dummy variable: Brad_HP, Brand_Dell, Brand_Lenovo, Brand_Asus. Each of these
variable can take value 0 or 1
Why Multicollinearity is a problem?
• Multi collinearity has no impact on predictive power of the model as a whole, but it affects the
calculation for individual predictors
• Parameter estimates may change erratically for small changes in the model or data. It will
make the estimate highly imbalanced.
• This instability will increase the variance of estimates. It means that if there is a small change
in X (i.e. independent variable), produces large changes in estimate(i.e. expected output).
Multicollinearity –
Variance Inflation Factor
• The variable with higher VIF and/or the variable with lower significance from a business
point of view can be removed.
• VIF of any one independent variable is calculated by regressing that variable with other
independent variables
• VIF threshold depends on the business problem. VIF greater than 2 or 5 or 10 is considered
as a VIF threshold.
P-value significance
• Remove variable which has a higher p-value in order of their insignificance.
• If the model have a larger of variables in the final model, then the criteria for selecting
variables in the model can be made toughen i.e. p < 0.02 or 0.03
Assumptions
Top-1 :
Assumptions
• The regression model is linear in
parameter
o For example, the below equation shows
linear relationship in parameters.