01 Intro To ML Alogs + Linear Regression

1
Machine Learning +
Linear Regression
AN INTRODUCTION
2
Discussion
 Intro to Machine Learning (ML)  Linear Regression

 Models  Regression Line
 Regression  Best Fit Line
 Classification  Strength of a simple Linear Regression
 Clustering  Assumptions
 Broader Classification of ML Models  Reading & Understanding Data
 Supervised learning
 Hypothesis Testing
 Unsupervised learning
 Building a linear model
Data Analytics [ELE 4077] 05-12-2021

3
Introduction to Machine Learning

4
ML Model Classification
Based on
Task Performed & Nature of the output
Regression Classification Clustering
No predefined
The output The output
e.g., the e.g., classifying notion of a
variable to be variable to be
score of a incoming emails label is e.g., customer
predicted is predicted is
student in a as spam or allocated to the segmentation.
a continuous a categorical
subject. ham. groups/clusters
variable, variable
formed

5
Regression
Previous Year’s 12th Previous Year’s BET

Marks marks
Current Year’s 12th Current Year’s BET

Marks marks ???

6
Classification
Emails SPAM or HAM

Email-1 SPAM
Email-2 SPAM
Email-3 SPAM
Email-4 HAM
Email-5 HAM
Email-6 SPAM
Email-7 SPAM
Email-8 ???

7
Clustering
News articles
Sports Business
Political
8
ML Model Broader Classification
Continuous Previous Yr. Score of a student

Regression in BET
Labels
Supervised
Learning
Categorical
Classification Spam or Ham
Labels
Unsupervised
Clustering No Labels Uses data to create cluster
Learning

9
ML Model Classification contd…
Broader Classification
Unsupervised
Supervised learning
learning
No predefined labels are

Past data with labels is used for building the model
assigned to past data
Regression Classification Clustering

10
Simple Linear Regression

11
Linear Regression
Model with 1
Simple LR independent
variable
LR
Model with more
than 1
Multiple LR
independent
variables

12
Linear Regression
Build simple LR model

Objective: Finding answers to
questions likes
1. Is there any relationship
between Sales & Marketing?
2. What will be the sales for
future Marketing
expenditure?
etc.

13
Regression Line
Standard
Equation of
regression line
Independent variable
also called predictor
variable,
Dependent variable
also called output
variable

14
Best Fit Line
 How to find the best-fit line?

 Answer: Calculate residual
 Generate Ordinary Least Squares
 Find Residual Sum of Squares (RSS)
 RSS is the cost function (needs to be
minimized.

15
Best Fit Line

16
Cost Function
RSS
Minimization Differentiation
Gradient Descent
[start with initial
parameters(𝛽0 & 𝛽1 )]
17
Excel Demo of Simple LR

18
Strength of Simple Linear Regression
 After determining the best fit line, there are a few critical questions that
need answers:
1. How well does the best fit line represent the scatter plot?
2. How well does the best fit line predict the new data?
 Determined using 𝑅2 (called Coefficient of Determination)

19
RSS & TSS
 TSS gives us the deviation of all

the points from the mean line

Strength of Simple Linear Regression 20
𝑅2 (called Coefficient of Determination)
 R2 is a number which explains what portion of

the given data variation is explained by the
developed model.
 Always takes value between 0 & 1. RSS = Residual Sum of Error
𝑅𝑆𝑆
𝑅2 =1−
𝑇𝑆𝑆
where, TSS = Total Sum of Errors of the Data from
RSS = Residual Sum of Error Mean
TSS = Total Sum of Errors of the Data from
Mean

21
Physical Significance of R2
𝑅𝑆𝑆
𝑅2 = 1 −
𝑇𝑆𝑆

22
 Model Formulation: 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝝐
 Best-Fit Line: minimize Residual sum of errors (RSS)
𝑹𝑺𝑺
 Assessing goodness of Fit: 𝑹𝟐 = 𝟏 − 𝑻𝑺𝑺
 Meaning: “how much variation in Y can be
explained by the model”
 Make predictions: 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Question: Is the assumption that linear dependence

of output variable with input variable enough to build
our LR model?

23
Answer: No
 Because you are drawing inferences
on the population using a sample.
 Population is much larger than sample
size.
 Introduces errors in the model.
 Therefore, it is important to define
broader assumptions for the model

24
Simple Linear Regression- Assumptions
Assumption-1: Linear relationship

between X and Y

25
Assumption-2: Error terms are

normally distributed with
mean zero
If you just wish to fit a line and

not make any further
interpretations, then this
assumption is not required.
Question: Why normal distribution?

Why not other distribution?
Answer: generally seen that error terms
follow normal distribution with mean
equals to zero

26
Assumption-3: Error terms are

independent of each other

27
Assumption-4: Error terms have constant

variance (called homo-sce-das-ti-city)
 The variance should not increase (or

decrease) as the error values change.
 The variance should not follow any
pattern as the error terms change.

32
Hypothesis Testing in Simple LR
(significance of the derived beta coefficient)
 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
 Is the beta coefficient significant?
 If you run a LR on a dataset in Python, it will fit a line on
the data as shown.
 If a line is fit, then it will have 𝛽መ0 & 𝛽መ1 .
 At this stage, check if 𝛽መ1 is significant or not?
 Start with a hypothesis: 𝛽መ1 is not significant (i.e., no
relationship between y & x)

33
Hypothesis Testing in Simple LR
(significance of the derived beta coefficient)
Null Hypothesis: 𝐇𝟎 : 𝛽መ1 = 0

Alternate hypothesis:𝐇𝟏 : 𝛽መ1 ≠ 0
 Fail to reject the null hypothesis: 𝛽෠1 is zero (i.e., insignificant
& of no use in the model).
 Reject the null hypothesis: 𝛽෠1 is not zero (i.e., the line fitted
is a significant one).
 To perform the hypothesis test, derive the p-value for
the given 𝛽መ1
 if p-value < 0.05, reject 𝐇𝟎 (i.e., 𝛽෠1 is significant).

34
Multiple Linear Regression

35
 Equation is given by
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜖
 Interpretation of the coefficients:

 Change in the expected value of Y, E(Y), if Xi increase by 1 unit, keeping other
predictors (independent variables) constant.

36
Some Characteristics of MLR
 The model now fits a hyperplane instead of a line.

 Coefficients are still obtained by minimising the sum of squared errors, the
least squares criteria.
 The assumptions of Error Terms from simple linear regression still hold:
 Zero mean,
 Independent
 Normally distributed
 Have constant variance.

37
Example: Sales vs TV, radio & Newspaper Advertisings
Independent Variable R2
TV 0.816
TV + Newspaper 0.836
TV + Radio 0.910
TV + Radio + Newspaper 0.910
For Demonstration: Run the Python code

38
Model Assessment and Comparison
 Once a MLR model is built

 Assess in terms of its predictive powers
 May build more than 1 model
 Using different combinations of predictor variables
 Then, compare all the models to find which one yields optimal result(s).

39
Suppose there is a dataset with 10 variables.

We built 2 models with different set of predictor variables as given below.
 Model A: uses only 6 predictor variables (Bias error)
 Model B: uses all 10 predictor variables (Variance error)
Their R2 is almost same.
Which one to choose?

40
Bias-Variance Tradeoff
Bias Variance
Simple models with less Complex models with more
predicator variables. predicator variables
Error due to wrong Model is highly sensitive to
assumptions. slight fluctuations in training
set
High Bias means: Model High Variance means:
misses relevant connection Model will train on random
b/w predictor variables & noise as well.
output variable.
Results in underfitting. Results in overfitting

41
Bias & Variance
bias low, variance low bias high, variance low

Data Analytics [ELE 4077] Source: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff 05-12-2021
42
Bias & Variance
bias low, variance high bias high, variance high

Data Analytics [ELE 4077] Source: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff 05-12-2021
43
 Tweak R2 to accommodate for the number of records (N) &

predictor variable (p) 2
(1 − 𝑅2 )(𝑁 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 −
(𝑁 − 1 − 𝑝)
 Model(s) with more no. of variables & less no. of records

must be penalized over the one which has large no. of
𝑅𝑆𝑆
records & less no. of variables. 𝐴𝐼𝐶 = 𝑛 × log + 2𝑝
𝑛
Where,
𝑅𝑆𝑆 n = sample size
𝑅2 = 1 − p = no. of predictor variables
𝑇𝑆𝑆
where,
RSS = Residual Sum of Error Low AIC is better
TSS = Total Sum of Errors of the Data from Mean

44
New Considerations in MLR
Considerations
Feature
Overfitting Multicollinearity
Selection
Inter-
High accuracy in
dependence of Selecting optimal
Training than in
Predictor features
Testing
Variables

45
Considerations- Overfitting
 As you keep adding variables, the

model may become too complex.
 It may end up memorising the training

data and, consequently, fail to
generalise.
 A model is generally said to overfit when

the training accuracy is high while the
test accuracy is very low.
Picture Source: Quora

46
Considerations- Multicollinearity
 This refers to associations between predictor (independent) variables.

 Thought Experiment: If X1 & X2 are same. Is there any difference in their effects the
following combinations?
 Then 5. X1 + 5. X2 will have the same effect on output as
 2. X1 + 8. X2 or 13. X1 - 3. X2 etc.
 Thus, making it difficult to understand which of X1 & X2 variables is contributing to variation in
output.
 In real-world data, there may be colinearly dependent variables
 But may have strong relationship between 2 or more variables

47
Considerations- Multicollinearity contd…
Affects:
 Interpretation
 Does “change in Y when all others are held constant” apply?
 No, because some of the variables (Xi) are not independent.
 Inference
 Coefficients swing wildly, signs can invert.
 Therefore, p-values are not reliable.

48
Detection:
 Looking at pairwise correlations (for 2
variables):
 Looking at the correlation between
different pairs of (independent) variables
Using Scatter plot For Quantitative

to inspect visually inspection: Correlation
matrix.
Data Analytics [ELE 4077] Images: from Google 05-12-2021
49
Detection:
 Checking Variance Inflation Factor (VIF) (for 2 or more correlated
variables): 1
𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2
 Idea in developing VIF is- If there are 4 predictor variables, say X1, If VIF is
X2, X3 & X4. > 10: VIF value is high, & the
 Can we build a model taking X1 as dependent variable and remaining variable should be eliminated.
3 as independent variable?
> 5: Can be okay, but it is worth
 If Yes, then X1 is strongly corelated with X2, X3 & X4. inspecting.
 This is repeated for X2, then X3 and so on…
< 5: Good VIF value. No need to
eliminate this variable.

50
How to Deal:
 Dropping variables
 Drop the variable that is highly correlated with others.
 Choose the one which has business interpretability.
 Creating a new variable using the interactions of the older variables

 Add interaction features, i.e., features derived using some of the original features.
 Variable transformations
 Principal component analysis

51
Considerations- Feature Selection
For building optimal model,

 Try all possible combinations of independent variables to get best fit.
 Manual feature (variable) elimination comes in picture
 Build the model with all the features
 Drop the features that are the least helpful in prediction (high p-value),
 Drop the features that are redundant (using correlations and VIF),
 Rebuild the model and repeat.
 Limitation: Works when variables are less. Tedious when are large (say in 100s).
 Automated approaches of feature (variable) elimination
 Recursive feature elimination (RFE): select top “n” features.

52
Dealing with Categorical Variable
Relationship Relationship In a
Dummy Single Married
Status Status Relationship
table Single 1 0 0
Single
In a
In a relationship 0 1 0
Relationship
Complicated Married 0 0 1
Drop the Single column
For a categorical variable with N Relationship In a

levels, N-1 dummy variables can Married
Status Relationship
be used to describe. Single 0 0
In a
1 0
Relationship
Data Analytics [ELE 4077]
Married 0 1 05-12-2021
53
Feature Scaling
 Technique to standardize independent (continuous/ numerical)

variables within a fixed range at pre-processing stage.
 Why ?
 Because algorithms based on gradient descent tend to assign greater
weightage to features (or variables) with larger magnitude & vice-
versa.
 This will impact the performance of the model.

54
Feature Scaling Techniques
Standardisation:
 Brings all the data into a standard normal distribution with mean 0 and
standard deviation 1
𝒙−𝝁
 𝒙′ =
𝝈
Min-Max Scaling (Normalization):

 Brings all the data in the range of 0 & 1.
𝒙−𝒙𝒎𝒊𝒏
 𝒙′ =
𝒙𝒎𝒂𝒙 −𝒙𝒎𝒊𝒏

55
Reference
 Upgrad.com
 http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis

01 Intro To ML Alogs + Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 Intro To ML Alogs + Linear Regression

Uploaded by

Copyright:

Available Formats

1

 Intro to Machine Learning (ML)  Linear Regression

Data Analytics [ELE 4077] 05-12-2021

Introduction to Machine Learning

Data Analytics [ELE 4077] 05-12-2021

Regression Classification Clustering

Data Analytics [ELE 4077] 05-12-2021

Previous Year’s 12th Previous Year’s BET

Current Year’s 12th Current Year’s BET

Data Analytics [ELE 4077] 05-12-2021

Emails SPAM or HAM

Data Analytics [ELE 4077] 05-12-2021

Continuous Previous Yr. Score of a student

Data Analytics [ELE 4077] 05-12-2021

No predefined labels are

Regression Classification Clustering

Data Analytics [ELE 4077] 05-12-2021

Simple Linear Regression

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

Build simple LR model

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

 How to find the best-fit line?

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

Excel Demo of Simple LR

Data Analytics [ELE 4077] 05-12-2021

 Determined using 𝑅2 (called Coefficient of Determination)

Data Analytics [ELE 4077] 05-12-2021

 TSS gives us the deviation of all

Data Analytics [ELE 4077] 05-12-2021

 R2 is a number which explains what portion of

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

Question: Is the assumption that linear dependence

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

Assumption-1: Linear relationship

Data Analytics [ELE 4077] 05-12-2021

Assumption-2: Error terms are

If you just wish to fit a line and

Question: Why normal distribution?

Data Analytics [ELE 4077] 05-12-2021

Assumption-3: Error terms are

Data Analytics [ELE 4077] 05-12-2021

Assumption-4: Error terms have constant

 The variance should not increase (or

Data Analytics [ELE 4077] 05-12-2021

Data Analytics [ELE 4077] 05-12-2021

Null Hypothesis: 𝐇𝟎 : 𝛽መ1 = 0

Data Analytics [ELE 4077] 05-12-2021

Multiple Linear Regression

Data Analytics [ELE 4077] 05-12-2021

 Interpretation of the coefficients:

Data Analytics [ELE 4077] 05-12-2021

 The model now fits a hyperplane instead of a line.

Data Analytics [ELE 4077] 05-12-2021

Example: Sales vs TV, radio & Newspaper Advertisings

For Demonstration: Run the Python code

Data Analytics [ELE 4077] 05-12-2021

 Once a MLR model is built