You are on page 1of 51

1

Machine Learning +
Linear Regression
AN INTRODUCTION
2
Discussion

 Intro to Machine Learning (ML)  Linear Regression


 Models  Regression Line
 Regression  Best Fit Line
 Classification  Strength of a simple Linear Regression
 Clustering  Assumptions
 Broader Classification of ML Models  Reading & Understanding Data
 Supervised learning
 Hypothesis Testing
 Unsupervised learning
 Building a linear model

Data Analytics [ELE 4077] 05-12-2021


3

Introduction to Machine Learning

Data Analytics [ELE 4077] 05-12-2021


4
ML Model Classification

Based on
Task Performed & Nature of the output

Regression Classification Clustering

No predefined
The output The output
e.g., the e.g., classifying notion of a
variable to be variable to be
score of a incoming emails label is e.g., customer
predicted is predicted is
student in a as spam or allocated to the segmentation.
a continuous a categorical
subject. ham. groups/clusters
variable, variable
formed

Data Analytics [ELE 4077] 05-12-2021


5
Regression

Previous Year’s 12th Previous Year’s BET


Marks marks

Current Year’s 12th Current Year’s BET


Marks marks ???

Data Analytics [ELE 4077] 05-12-2021


6
Classification

Emails SPAM or HAM


Email-1 SPAM
Email-2 SPAM
Email-3 SPAM
Email-4 HAM
Email-5 HAM
Email-6 SPAM
Email-7 SPAM
Email-8 ???

Data Analytics [ELE 4077] 05-12-2021


7
Clustering

News articles

Sports Business

Political
Data Analytics [ELE 4077] 05-12-2021
8
ML Model Broader Classification

Continuous Previous Yr. Score of a student


Regression in BET
Labels

Supervised
Learning

Categorical
Classification Spam or Ham
Labels

Unsupervised
Clustering No Labels Uses data to create cluster
Learning

Data Analytics [ELE 4077] 05-12-2021


9
ML Model Classification contd…

Broader Classification
Unsupervised
Supervised learning
learning

No predefined labels are


Past data with labels is used for building the model
assigned to past data

Regression Classification Clustering

Data Analytics [ELE 4077] 05-12-2021


10

Simple Linear Regression

Data Analytics [ELE 4077] 05-12-2021


11
Linear Regression

Model with 1
Simple LR independent
variable
LR
Model with more
than 1
Multiple LR
independent
variables

Data Analytics [ELE 4077] 05-12-2021


12
Linear Regression

Build simple LR model


Objective: Finding answers to
questions likes
1. Is there any relationship
between Sales & Marketing?
2. What will be the sales for
future Marketing
expenditure?
etc.

Data Analytics [ELE 4077] 05-12-2021


13
Regression Line

Standard
Equation of
regression line

Independent variable
also called predictor
variable,

Dependent variable
also called output
variable

Data Analytics [ELE 4077] 05-12-2021


14
Best Fit Line

 How to find the best-fit line?


 Answer: Calculate residual
 Generate Ordinary Least Squares
 Find Residual Sum of Squares (RSS)
 RSS is the cost function (needs to be
minimized.

Data Analytics [ELE 4077] 05-12-2021


15
Best Fit Line

Data Analytics [ELE 4077] 05-12-2021


16
Cost Function

RSS
Minimization Differentiation

Gradient Descent
[start with initial
parameters(𝛽0 & 𝛽1 )]
Data Analytics [ELE 4077] 05-12-2021
17

Excel Demo of Simple LR

Data Analytics [ELE 4077] 05-12-2021


18
Strength of Simple Linear Regression

 After determining the best fit line, there are a few critical questions that
need answers:
1. How well does the best fit line represent the scatter plot?
2. How well does the best fit line predict the new data?

 Determined using 𝑅2 (called Coefficient of Determination)

Data Analytics [ELE 4077] 05-12-2021


19
RSS & TSS

 TSS gives us the deviation of all


the points from the mean line

Data Analytics [ELE 4077] 05-12-2021


Strength of Simple Linear Regression 20
𝑅2 (called Coefficient of Determination)

 R2 is a number which explains what portion of


the given data variation is explained by the
developed model.
 Always takes value between 0 & 1. RSS = Residual Sum of Error

𝑅𝑆𝑆
𝑅2 =1−
𝑇𝑆𝑆
where, TSS = Total Sum of Errors of the Data from
RSS = Residual Sum of Error Mean
TSS = Total Sum of Errors of the Data from
Mean

Data Analytics [ELE 4077] 05-12-2021


21
Physical Significance of R2

𝑅𝑆𝑆
𝑅2 = 1 −
𝑇𝑆𝑆

Data Analytics [ELE 4077] 05-12-2021


22
Simple Linear Regression

 Model Formulation: 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝝐
 Best-Fit Line: minimize Residual sum of errors (RSS)
𝑹𝑺𝑺
 Assessing goodness of Fit: 𝑹𝟐 = 𝟏 − 𝑻𝑺𝑺
 Meaning: “how much variation in Y can be
explained by the model”
 Make predictions: 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥

Question: Is the assumption that linear dependence


of output variable with input variable enough to build
our LR model?

Data Analytics [ELE 4077] 05-12-2021


23
Simple Linear Regression

Answer: No
 Because you are drawing inferences
on the population using a sample.
 Population is much larger than sample
size.
 Introduces errors in the model.
 Therefore, it is important to define
broader assumptions for the model

Data Analytics [ELE 4077] 05-12-2021


24
Simple Linear Regression- Assumptions

Assumption-1: Linear relationship


between X and Y

Data Analytics [ELE 4077] 05-12-2021


25
Simple Linear Regression- Assumptions

Assumption-2: Error terms are


normally distributed with
mean zero

If you just wish to fit a line and


not make any further
interpretations, then this
assumption is not required.

Question: Why normal distribution?


Why not other distribution?
Answer: generally seen that error terms
follow normal distribution with mean
equals to zero

Data Analytics [ELE 4077] 05-12-2021


26
Simple Linear Regression- Assumptions

Assumption-3: Error terms are


independent of each other

Data Analytics [ELE 4077] 05-12-2021


27
Simple Linear Regression- Assumptions

Assumption-4: Error terms have constant


variance (called homo-sce-das-ti-city)

 The variance should not increase (or


decrease) as the error values change.
 The variance should not follow any
pattern as the error terms change.

Data Analytics [ELE 4077] 05-12-2021


32
Hypothesis Testing in Simple LR
(significance of the derived beta coefficient)

 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
 Is the beta coefficient significant?
 If you run a LR on a dataset in Python, it will fit a line on
the data as shown.
 If a line is fit, then it will have 𝛽መ0 & 𝛽መ1 .
 At this stage, check if 𝛽መ1 is significant or not?
 Start with a hypothesis: 𝛽መ1 is not significant (i.e., no
relationship between y & x)

Data Analytics [ELE 4077] 05-12-2021


33
Hypothesis Testing in Simple LR
(significance of the derived beta coefficient)

Null Hypothesis: 𝐇𝟎 : 𝛽መ1 = 0


Alternate hypothesis:𝐇𝟏 : 𝛽መ1 ≠ 0
 Fail to reject the null hypothesis: 𝛽෠1 is zero (i.e., insignificant
& of no use in the model).
 Reject the null hypothesis: 𝛽෠1 is not zero (i.e., the line fitted
is a significant one).
 To perform the hypothesis test, derive the p-value for
the given 𝛽መ1
 if p-value < 0.05, reject 𝐇𝟎 (i.e., 𝛽෠1 is significant).

Data Analytics [ELE 4077] 05-12-2021


34

Multiple Linear Regression

Data Analytics [ELE 4077] 05-12-2021


35
Multiple Linear Regression

 Equation is given by
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜖

 Interpretation of the coefficients:


 Change in the expected value of Y, E(Y), if Xi increase by 1 unit, keeping other
predictors (independent variables) constant.

Data Analytics [ELE 4077] 05-12-2021


36
Some Characteristics of MLR

 The model now fits a hyperplane instead of a line.


 Coefficients are still obtained by minimising the sum of squared errors, the
least squares criteria.
 The assumptions of Error Terms from simple linear regression still hold:
 Zero mean,
 Independent
 Normally distributed
 Have constant variance.

Data Analytics [ELE 4077] 05-12-2021


37
Multiple Linear Regression

Example: Sales vs TV, radio & Newspaper Advertisings

Independent Variable R2
TV 0.816
TV + Newspaper 0.836
TV + Radio 0.910
TV + Radio + Newspaper 0.910

For Demonstration: Run the Python code

Data Analytics [ELE 4077] 05-12-2021


38
Model Assessment and Comparison

 Once a MLR model is built


 Assess in terms of its predictive powers
 May build more than 1 model
 Using different combinations of predictor variables

 Then, compare all the models to find which one yields optimal result(s).

Data Analytics [ELE 4077] 05-12-2021


39
Model Assessment and Comparison

Suppose there is a dataset with 10 variables.


We built 2 models with different set of predictor variables as given below.
 Model A: uses only 6 predictor variables (Bias error)
 Model B: uses all 10 predictor variables (Variance error)
Their R2 is almost same.

Which one to choose?

Data Analytics [ELE 4077] 05-12-2021


40
Model Assessment and Comparison

Bias-Variance Tradeoff
Bias Variance
Simple models with less Complex models with more
predicator variables. predicator variables
Error due to wrong Model is highly sensitive to
assumptions. slight fluctuations in training
set
High Bias means: Model High Variance means:
misses relevant connection Model will train on random
b/w predictor variables & noise as well.
output variable.

Results in underfitting. Results in overfitting


Data Analytics [ELE 4077] 05-12-2021
41
Model Assessment and Comparison

Bias & Variance

bias low, variance low bias high, variance low


Data Analytics [ELE 4077] Source: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff 05-12-2021
42
Model Assessment and Comparison

Bias & Variance

bias low, variance high bias high, variance high


Data Analytics [ELE 4077] Source: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff 05-12-2021
43
Model Assessment and Comparison

 Tweak R2 to accommodate for the number of records (N) &


predictor variable (p) 2
(1 − 𝑅2 )(𝑁 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 −
(𝑁 − 1 − 𝑝)

 Model(s) with more no. of variables & less no. of records


must be penalized over the one which has large no. of
𝑅𝑆𝑆
records & less no. of variables. 𝐴𝐼𝐶 = 𝑛 × log + 2𝑝
𝑛
Where,
𝑅𝑆𝑆 n = sample size
𝑅2 = 1 − p = no. of predictor variables
𝑇𝑆𝑆
where,
RSS = Residual Sum of Error Low AIC is better
TSS = Total Sum of Errors of the Data from Mean

Data Analytics [ELE 4077] 05-12-2021


44
New Considerations in MLR

Considerations
Feature
Overfitting Multicollinearity
Selection

Inter-
High accuracy in
dependence of Selecting optimal
Training than in
Predictor features
Testing
Variables

Data Analytics [ELE 4077] 05-12-2021


45
Considerations- Overfitting

 As you keep adding variables, the


model may become too complex.

 It may end up memorising the training


data and, consequently, fail to
generalise.

 A model is generally said to overfit when


the training accuracy is high while the
test accuracy is very low.
Picture Source: Quora

Data Analytics [ELE 4077] 05-12-2021


46
Considerations- Multicollinearity

 This refers to associations between predictor (independent) variables.


 Thought Experiment: If X1 & X2 are same. Is there any difference in their effects the
following combinations?
 Then 5. X1 + 5. X2 will have the same effect on output as
 2. X1 + 8. X2 or 13. X1 - 3. X2 etc.
 Thus, making it difficult to understand which of X1 & X2 variables is contributing to variation in
output.
 In real-world data, there may be colinearly dependent variables
 But may have strong relationship between 2 or more variables

Data Analytics [ELE 4077] 05-12-2021


47
Considerations- Multicollinearity contd…

Affects:
 Interpretation
 Does “change in Y when all others are held constant” apply?
 No, because some of the variables (Xi) are not independent.
 Inference
 Coefficients swing wildly, signs can invert.
 Therefore, p-values are not reliable.

Data Analytics [ELE 4077] 05-12-2021


48
Considerations- Multicollinearity contd…

Detection:
 Looking at pairwise correlations (for 2
variables):
 Looking at the correlation between
different pairs of (independent) variables

Using Scatter plot For Quantitative


to inspect visually inspection: Correlation
matrix.
Data Analytics [ELE 4077] Images: from Google 05-12-2021
49
Considerations- Multicollinearity contd…

Detection:
 Checking Variance Inflation Factor (VIF) (for 2 or more correlated
variables): 1
𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2
 Idea in developing VIF is- If there are 4 predictor variables, say X1, If VIF is
X2, X3 & X4. > 10: VIF value is high, & the
 Can we build a model taking X1 as dependent variable and remaining variable should be eliminated.
3 as independent variable?
> 5: Can be okay, but it is worth
 If Yes, then X1 is strongly corelated with X2, X3 & X4. inspecting.
 This is repeated for X2, then X3 and so on…
< 5: Good VIF value. No need to
eliminate this variable.

Data Analytics [ELE 4077] 05-12-2021


50
Considerations- Multicollinearity contd…

How to Deal:
 Dropping variables
 Drop the variable that is highly correlated with others.
 Choose the one which has business interpretability.

 Creating a new variable using the interactions of the older variables


 Add interaction features, i.e., features derived using some of the original features.

 Variable transformations
 Principal component analysis

Data Analytics [ELE 4077] 05-12-2021


51
Considerations- Feature Selection

For building optimal model,


 Try all possible combinations of independent variables to get best fit.
 Manual feature (variable) elimination comes in picture
 Build the model with all the features
 Drop the features that are the least helpful in prediction (high p-value),
 Drop the features that are redundant (using correlations and VIF),
 Rebuild the model and repeat.
 Limitation: Works when variables are less. Tedious when are large (say in 100s).
 Automated approaches of feature (variable) elimination
 Recursive feature elimination (RFE): select top “n” features.

Data Analytics [ELE 4077] 05-12-2021


52
Dealing with Categorical Variable

Relationship Relationship In a
Dummy Single Married
Status Status Relationship
table Single 1 0 0
Single
In a
In a relationship 0 1 0
Relationship
Complicated Married 0 0 1

Drop the Single column

For a categorical variable with N Relationship In a


levels, N-1 dummy variables can Married
Status Relationship
be used to describe. Single 0 0
In a
1 0
Relationship
Data Analytics [ELE 4077]
Married 0 1 05-12-2021
53
Feature Scaling

 Technique to standardize independent (continuous/ numerical)


variables within a fixed range at pre-processing stage.
 Why ?
 Because algorithms based on gradient descent tend to assign greater
weightage to features (or variables) with larger magnitude & vice-
versa.
 This will impact the performance of the model.

Data Analytics [ELE 4077] 05-12-2021


54
Feature Scaling Techniques

Standardisation:
 Brings all the data into a standard normal distribution with mean 0 and
standard deviation 1
𝒙−𝝁
 𝒙′ =
𝝈

Min-Max Scaling (Normalization):


 Brings all the data in the range of 0 & 1.
𝒙−𝒙𝒎𝒊𝒏
 𝒙′ =
𝒙𝒎𝒂𝒙 −𝒙𝒎𝒊𝒏

Data Analytics [ELE 4077] 05-12-2021


55
Reference

 Upgrad.com
 http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis

Data Analytics [ELE 4077] 05-12-2021

You might also like