Professional Documents
Culture Documents
Machine Learning +
Linear Regression
AN INTRODUCTION
2
Discussion
Based on
Task Performed & Nature of the output
No predefined
The output The output
e.g., the e.g., classifying notion of a
variable to be variable to be
score of a incoming emails label is e.g., customer
predicted is predicted is
student in a as spam or allocated to the segmentation.
a continuous a categorical
subject. ham. groups/clusters
variable, variable
formed
News articles
Sports Business
Political
Data Analytics [ELE 4077] 05-12-2021
8
ML Model Broader Classification
Supervised
Learning
Categorical
Classification Spam or Ham
Labels
Unsupervised
Clustering No Labels Uses data to create cluster
Learning
Broader Classification
Unsupervised
Supervised learning
learning
Model with 1
Simple LR independent
variable
LR
Model with more
than 1
Multiple LR
independent
variables
Standard
Equation of
regression line
Independent variable
also called predictor
variable,
Dependent variable
also called output
variable
RSS
Minimization Differentiation
Gradient Descent
[start with initial
parameters(𝛽0 & 𝛽1 )]
Data Analytics [ELE 4077] 05-12-2021
17
After determining the best fit line, there are a few critical questions that
need answers:
1. How well does the best fit line represent the scatter plot?
2. How well does the best fit line predict the new data?
𝑅𝑆𝑆
𝑅2 =1−
𝑇𝑆𝑆
where, TSS = Total Sum of Errors of the Data from
RSS = Residual Sum of Error Mean
TSS = Total Sum of Errors of the Data from
Mean
𝑅𝑆𝑆
𝑅2 = 1 −
𝑇𝑆𝑆
Model Formulation: 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝝐
Best-Fit Line: minimize Residual sum of errors (RSS)
𝑹𝑺𝑺
Assessing goodness of Fit: 𝑹𝟐 = 𝟏 − 𝑻𝑺𝑺
Meaning: “how much variation in Y can be
explained by the model”
Make predictions: 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Answer: No
Because you are drawing inferences
on the population using a sample.
Population is much larger than sample
size.
Introduces errors in the model.
Therefore, it is important to define
broader assumptions for the model
𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Is the beta coefficient significant?
If you run a LR on a dataset in Python, it will fit a line on
the data as shown.
If a line is fit, then it will have 𝛽መ0 & 𝛽መ1 .
At this stage, check if 𝛽መ1 is significant or not?
Start with a hypothesis: 𝛽መ1 is not significant (i.e., no
relationship between y & x)
Equation is given by
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜖
Independent Variable R2
TV 0.816
TV + Newspaper 0.836
TV + Radio 0.910
TV + Radio + Newspaper 0.910
Then, compare all the models to find which one yields optimal result(s).
Bias-Variance Tradeoff
Bias Variance
Simple models with less Complex models with more
predicator variables. predicator variables
Error due to wrong Model is highly sensitive to
assumptions. slight fluctuations in training
set
High Bias means: Model High Variance means:
misses relevant connection Model will train on random
b/w predictor variables & noise as well.
output variable.
Considerations
Feature
Overfitting Multicollinearity
Selection
Inter-
High accuracy in
dependence of Selecting optimal
Training than in
Predictor features
Testing
Variables
Affects:
Interpretation
Does “change in Y when all others are held constant” apply?
No, because some of the variables (Xi) are not independent.
Inference
Coefficients swing wildly, signs can invert.
Therefore, p-values are not reliable.
Detection:
Looking at pairwise correlations (for 2
variables):
Looking at the correlation between
different pairs of (independent) variables
Detection:
Checking Variance Inflation Factor (VIF) (for 2 or more correlated
variables): 1
𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2
Idea in developing VIF is- If there are 4 predictor variables, say X1, If VIF is
X2, X3 & X4. > 10: VIF value is high, & the
Can we build a model taking X1 as dependent variable and remaining variable should be eliminated.
3 as independent variable?
> 5: Can be okay, but it is worth
If Yes, then X1 is strongly corelated with X2, X3 & X4. inspecting.
This is repeated for X2, then X3 and so on…
< 5: Good VIF value. No need to
eliminate this variable.
How to Deal:
Dropping variables
Drop the variable that is highly correlated with others.
Choose the one which has business interpretability.
Variable transformations
Principal component analysis
Relationship Relationship In a
Dummy Single Married
Status Status Relationship
table Single 1 0 0
Single
In a
In a relationship 0 1 0
Relationship
Complicated Married 0 0 1
Standardisation:
Brings all the data into a standard normal distribution with mean 0 and
standard deviation 1
𝒙−𝝁
𝒙′ =
𝝈
Upgrad.com
http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis