You are on page 1of 53

Unit 1-Week2

linear regression, Bias, variance,


under and over fitting, curse of
dimensionality and ROC

Dr.S.Malarvizhi-Prof-ECE-SRM IST-18ECE307J 1
Regression
• In regression, a single dependent or outcome variable is predicted with
the help of one or more independent variables.
• It is an type of supervised learning, where (X,Y ) the training set , we
need to learn a function , so that given unknown (test) X it should
predict the value of Y.
• In regression the output is continuous.
• Many models could be used – Simplest is linear regression
– Fit data with the best hyper-plane which "goes through" the points

y
dependen
t
variable
(output)
x – independent variable (input)
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
2
18ECE307J
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
3
18ECE307J
Some fits to the data: which is best?

Red line is the fit-target line , M is


the degree of linear equation ,M=0 Dr.S.Malarvizhi-Prof-ECE-SRM IST-18ECE307J 4
Simple Linear Regression Equation
E(y)

Regression line

Intercept
Slope β1
β0

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
5
18ECE307J
Linear Regression Model

• Relationship Between Variables Is a Linear


Function
Population Population Random
Y-Intercept Slope Error

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
6
18ECE307J
Types of Regression Models

Regression
1 feature Models 2+ features

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
7
18ECE307J
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
Sample 15
4 56.9 12.5
houses
5 66.6 18.0
6 82.5 14.3
from the
7 126.3 27.5
region.
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
8
18ECE307J
House price vs size

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
9
18ECE307J
Linear Regression – Multiple Variables

• β0 is the intercept (i.e. the average value for Y if all


the X’s are zero), βj is the slope for the jth variable Xj

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
10
18ECE307J
Linear Regression
•  

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
11
18ECE307J
Criterion for choosing what line to draw:
method of least squares
•  

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
12
18ECE307J
The regression line
The least-squares regression line is the unique line such that
the sum of the squared vertical (y) distances between the data
points and the line is the smallest possible.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
13
18ECE307J
How do we "learn" parameters
• For the 2-D problem

• To find the values for the coefficients which minimize the


objective function we take the partial derivates of the
objective function (SSE sum of squared error) with respect to
the coefficients. Set these to 0, and solve.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
14
18ECE307J
Multiple Linear Regression

•  

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
15
18ECE307J
Problem

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
16
18ECE307J

Defining the line
try to minimize the distance between each datapoint and the line that we fit.
✔ We can measure the distance between a point and a line
✔ Now, we can try to minimize an error function that measures the sum of all these
distances. Minimize the sum-of-squares of the errors-least-squares optimization.
✔ choosing the parameters β in order to minimize the squared difference between
the prediction and the actual data value, summed over all of the datapoints. Given
input vector Y, the prediction is Y β

Two variables :

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
17
18ECE307J
classification problems into regression
problems
• This can be done in two ways
– first by introducing an indicator variable, which simply says which class
each data point belongs to. The problem is now to use the data to predict
the indicator variable, which is a regression problem.
– The second approach is to do repeated regression, once for each class,
with the indicator value being 1 for examples in the class and 0 for all of
the others.

making a prediction about an unknown value y


(such as the indicator variable for classes or a
future value of some data) by computing some
function of known values xi. With straight lines
model , output y is going to be a sum of the xi
values, each multiplied by a constant parameter:

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
18
18ECE307J
bias

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
19
18ECE307J
bias

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
20
18ECE307J
Bias
• Data bias in machine learning is a type of error in which certain elements
of a dataset are more heavily weighted and/or represented than others.
• A biased dataset does not accurately represent a model's use case,
resulting in skewed outcomes, low accuracy levels, and analytical errors.
• From Elite Data Science, bias is: “Bias occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… bias is an error from erroneous assumptions in the
learning algorithm. High bias can cause an algorithm to miss the relevant
relations between features and target outputs (underfitting).”
• “Bias is the algorithm’s tendency to consistently learn the wrong thing by
not taking into account all the information in the data (underfitting).”
• A high bias means the prediction will be inaccurate.
• Bias is the difference between the average prediction of our model and
the correct value which we are trying to predict. Model with high bias
pays very little attention to the training data and oversimplifies the model.
It always leads to high error on training and test data.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
21
18ECE307J
• parametric algorithms are prone to high bias. A parametric
algorithm is defined as, “A learning model that summarizes
data with a set of parameters of fixed size (independent of
the number of training examples) is called a parametric
model. No matter how much data you throw at a
parametric model, it won’t change its mind about how
many parameters it needs.”
• A linear regression is an example of a parametric algorithm.
These are easy to understand but not flexible to learn the
underlying signal of the data. Thus, they are inaccurate for
complex datasets.
• Examples of high-bias algorithms include Linear Regression,
Linear Discriminant Analysis, and Logistic Regression.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
22
18ECE307J
Variance

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
23
18ECE307J
Variance
• From EliteDataScience, the variance is: “Variance refers to an algorithm’s
sensitivity to specific sets of the training set occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… variance is an error from sensitivity to small
fluctuations in the training set. High variance can cause an algorithm to
model the random noise in the training data, rather than the intended
outputs (overfitting).”
• Variance is the algorithm’s tendency to learn random things irrespective of
the real signal by fitting highly flexible models that follow the error/noise in
the data too closely (overfitting).”
• Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data. Model with high variance pays a lot
of attention to training data and does not generalize on the data which it
hasn’t seen before. As a result, such models perform very well on training
data but has high error rates on test data.
• Variance leads to over fitting - in which small fluctuations in the training set
are magnified. A model with high-level variance may reflect random noise
in the training data set instead of the target function

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
24
18ECE307J
• Variance is the difference between many model’s
predictions.
• When we are implementing complicated models. Hence,
any ‘noise’ in the dataset, might be captured by the model.
A high variance tends to occur when we use complicated
models that can overfit our training sets.
• For example: a complicated model might depict people’s
name as a good predictor of our hypothesis.
• However, names are random and should not have any
predictive power.
• In one dataset, people with the name ‘Alex’ can indicate
they are likely to be criminals.
• In another dataset, people with the name ‘Alex’ can indicate
they likely to be graduates. Hence, names should not be
used as a predictive variable.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
25
18ECE307J
What is the TRADE-OFF?
• If you have a simple model, you might conclude that every “Alex”
are amazing people.
• This presents a High Bias and Low Variance problem.
• Your dataset is ‘biased’ towards people with the name Alex. Thus,
most predictions will be similar, since you believe people with
‘Alex’ act a certain way.
• You attempt to fix the model. However, the model is too
complicated.
• Your model has different results for different groups. Thus, Alex
can be a wonderful person, a criminal, an athlete, and a scholar.
• You must find balance! The good thing, if you do Cross-Validation,
you can train on many datasets and average their predictions.
Unfortunately, you cannot minimize bias and variance.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
26
18ECE307J
Low Bias — High Variance:
A low bias and high variance problem is overfitting. Different data sets are
depicting insights given their respective dataset. Hence, the models will predict
differently. However, if average the results, we will have a pretty accurate
prediction.
High Bias — Low Variance:
The predictions will be similar to one another but on average, they are
inaccurate.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
27
18ECE307J
bulls-eye diagram

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
28
18ECE307J
Lessons From Andrew Ng’s Course:

If you have HIGH VARIANCE PROBLEM:


• You can get more training examples because a larger the dataset is more
probable to get a higher predictions.
• Try smaller sets of features (because you are overfitting)
• Try increasing lambda, so you can not overfit the training set as much. The
higher the lambda, the more the regularization applies, for Linear
Regression with regularization.
If you have HIGH BIAS PROBLEM:
• Try getting additional features, you are generalizing the datasets.
• Try adding polynomial features, make the model more complicated.
• Try decreasing lambda, so you can try to fit the data better. The lower the
lambda, the less the regularization applies, for Linear Regression with
regularization.
Reminders:
• If a learning algorithm is suffering from high variance, getting more
training data helps a lot. High variance and low bias means overfitting.
This is caused by understanding the data to well. With more data, it will
find the signal and not the noise.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
29
18ECE307J
Bias and Variance-bulls-eye diagram
• Goal of supervised learning is to learn the target
function, which can best determine the target variable
from the set of input variables.
• Variance is the variability of model prediction for a
given data point or a value which tells us spread of our
data. Model with high variance pays a lot of attention
to training data and does not generalize on the data
which it hasn’t seen before. As a result, such models
perform very well on training data but has high error
rates on test data.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
30
18ECE307J
underfitting and overfitting
In supervised learning, underfitting :
• happens when a model unable to capture the underlying pattern of the data.
• These models usually have high bias and low variance.
• It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data.
• Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting:
• happens when our model captures the noise along with the underlying pattern in data.
• It happens when we train our model a lot over noisy dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which are prone to overfitting.
Underfitting
Goal :- supervised learning is to learn to derive target function which can best determined
the target variable from the set of input variables.
Fitness of a target function by learning algorithm determines how correctly it is able to
classify a set of data it has never seen.
underfitting :
• If the target function is kept too simple, it may not able to capture the essential nuances
(subtle) and represent underlying data well. Happens when a model unable to capture
the underlying pattern of the data.
• It happens when we have very less amount of training data to build an accurate model
• when we try to represent a non-linear data with linear model.
• Also, these kind of models are very simple to capture the complex patterns in data like
Linear and logistic regression.
• Underfitting results in both poor performance in test and training
very complex like Decision trees which are prone to overfitting. These models usually have
high bias and low variance.
Can be avoided by:
– Using more training data
– Reducing features by effective feature selection.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
32
18ECE307J
Overfitting

• This refers to a situation , where the model has been designed in such a way
that it emulates the training data too closely.
• This occurs due to trying to fit an excessively “ complex model” to loosely match
the training data.
• Target function tries to make sure all training data happens when our model
captures the noise along with the underlying pattern in data.
• Any specific deviation in the training data like noise or outliners gets embedded
in the model, it affect the performance of the model on the test data
It happens when we train our model a lot over noisy dataset.
These models have low bias and high variance.
These models provides good performance in traning set poor generalization
To avoid overfitting:
1. Using resampling techniques like cross validation.
2. Removal of nodes which have little or no predictive power for a machine to
learn.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
33
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
34
18ECE307J
How to use Learning Curves to Diagnose Machine
Learning Model Performance
• A learning curve is a plot of model learning performance
over experience or time.
• Learning curves are a widely used diagnostic tool in
machine learning for algorithms that learn from a training
dataset incrementally.
• Learning Curve: Line plot of learning (y-axis) over
experience (x-axis). Train Learning Curve: Learning curve
calculated from the training dataset that gives an idea of
how well the model is learning.
• Validation Learning Curve: Learning curve calculated from
a hold-out validation dataset that gives an idea of how well
the model is generalizing.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
35
18ECE307J
Underfit Learning Curves
• Underfitting refers to a model that cannot learn the
training dataset.
• A plot of learning curves shows underfitting if:
• The training loss remains flat regardless of training.
• The training loss continues to decrease until the end of training.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
36
18ECE307J
Overfit Learning Curves
• Overfitting refers to a model that has learned the training dataset too well,
including the statistical noise or random fluctuations in the training dataset.
• A plot of learning curves shows overfitting if:
– The plot of training loss continues to decrease with experience.
– The plot of validation loss decreases to a point and begins increasing again.
• The inflection point in validation loss may be the point at which training
could be halted as experience after that point shows the dynamics of
overfitting.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
37
18ECE307J
Good Fit Learning Curves
• Good Fit Learning Curves
• A good fit is the goal of the learning algorithm and exists between an overfit and
underfit model.
• A good fit is identified by a training and validation loss that decreases to a point
of stability with a minimal gap between the two final loss values.
• The loss of the model will almost always be lower on the training dataset than
the validation dataset. This means that we should expect some gap between the
train and validation loss learning curves. This gap is referred to as the
“generalization gap.”
• A plot of learning curves shows a good fit if:
• The plot of training loss decreases to a point of stability.
• The plot of validation loss decreases to a point of stability and has a small gap
with the training loss.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
38
18ECE307J

The Curse of dimensionality
This refers to a set of problems arise ,when working with higher dimensional data.
• Dimension of a dataset corresponds to the number of features/attributes that are present in a
dataset.
• Dataset with lager number of features generally order of 100 or more , and are referred as high
dimensional data.
• Presenting high dimensional data to the model during analysis (training or visualization) to
identify pattern, rather the model confused.
• Difficulties related to training machine learning models due to high dimensional data is referred
as “ curse of dimensionality”.

• As the number of features grows, the dimension grows, then the amount of data we need to
generalize accurately also grows exponentially.
• The term exponential is bad as in computer science it is related to computation complexity and
time.
• Two facets of curse of dimensionality are:
• 1. data sparsity – high variance(or) over-fitting
• 2.distance concentration ( this refer to the problem of all the pairwise distance between different
sample points in the space converge to the same value as the dimensionality of the data
increases.
Data sparsity

Table 1 with the target value depends on


gender and age group only. For two
variables, we needed 8 eight training
samples
If the target depends on a third attribute,
let’s say body type, the number of
training samples required to cover all the
combinations increases phenomenally.
For three variables, we need 24 samples.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
40
18ECE307J
• The essence of the curse is the realization that as the
number of dimensions increases, the volume of the unit
hypersphere does not increase with it.
Figure  As the dimensionality increases, the
classifier’s performance increases until the
optimal number of features is reached. Further
increasing the dimensionality without
increasing the number of training samples
results in a decrease in classifier performance.

The volume of the hypersphere tends


to zero as the dimensionality
increases.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
41
18ECE307J
• Training a model with sparse data could lead to high-variance or overfitting
condition.
• This is because while training the model, the model has learnt from the
frequently occurring combinations of the attributes and can predict the
outcome accurately.
• In real-time when less frequently occurring combinations are fed to the
model, it may not predict the outcome accurately. 
• 64x64 RGB image--dimensionality of the data is 64x64x3=12288. most of
the various combinations of pixels make absolutely no sense for any kind of
classification but are simple noise.

• But in reality, some combination of attributes occur more often than


others, due to this , training sample available for building the model may
not capture all possible combinations.
• This aspect , where training samples do not capture all combinations is
referred as data sparsity.
• Training model with sparse data lead to high-variance or overfitting.
• This is because, while training the model, the model learnt from frequently
occurring combination of the attributes and can predict the outcome
accurately
• In real time when less frequently occurring combinations are fed to the
model, it may not predict accurately.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
42
18ECE307J
Classification Error and noise
• For classification:- Confusion Matrix – it is a square matrix,
containing all possible classes in both horizontal and vertical
direction.
• List the classes along the top of the table as predicted output and
down left side targets.
• So, each element in the matrix(i,j) tell us how many input patterns
were put into the class i in the target, but class j by the algorithm.
Diagonal elements are c1c1 , c2c2 are correct classification.
• For class c3, miss classified as c1

c1 c2 c3
c1 5 1 0
c2 1 4 1
c3 2 0 4

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
43
18ECE307J
Two primary types of errors.
– Type 1 errors (false positives) - rejection of a true null
hypothesis
– Type 2 errors (false negatives)- the non-rejection of a
false null hypothesis
• TP true positive - is an observation correctly put
into class 1.
• FP false positive -is an observation incorrectly put
into class 1

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
44
18ECE307J
formulas

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
45
18ECE307J
• Accuracy: Overall, how often is the classifier correct?
– (TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate/Error Rate: Overall, how often it
is wrong?
– (FP+FN)/total = (10+5)/165 = 0.09 (1-Accuracy)
• True Positive Rate or "Sensitivity" or "Recall": When
it's actually yes, how often does it predict yes?
– TP/actual yes = 100/105 = 0.95

✔ Class yes(positive) =105


✔ Class No (negative) =60
✔ Total= 165 100 10
✔ Truly classified=150 (100 positive cases
and 50 negative cases) 05
✔ Wrongly classified= 15 50
✔ Original true but classified as false =05
✔ Original false but predicted to true=10
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
46
18ECE307J
• False Positive Rate: When it's actually no, how often does it predict
yes?
– FP/actual no = 10/60 = 0.17
• True Negative Rate/ Specificity: When it's actually no, how often
does it predict no?
– TN/actual no = 50/60 = 0.83 (1-False Positive Rate)
• Precision: When it predicts yes, how often is it correct? proportion
of positive predictions which are truly positive.
– TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition actually occur in our
sample?
– actual yes/total = 105/165 = 0.64
✔ Class yes(positive) =105
✔ Class No (negative) =60
✔ Total= 165
✔ Truly classified=150 (100 positive cases
100 10 and 50 negative cases)
✔ Wrongly classified= 15
05 50 ✔ Original true but classified as false =05
✔ Original false but predicted to true=10

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
47
18ECE307J
Example: medical –disease prediction ( benign or malignant)- prediction of
malignant- class of interest
• Sensitivity: gives prediction of tumours are actually
malignant and predicted as malignant.
• Specificity: indicates how a good balance of a model
being excessively conservative or excessively
aggressive. Portion of benign tumour which are
correctly classified
• Precision: proportion of positive prediction which are
truly positive.it indicates the reliability of the model
in predicting a class of interest.
• Model with high value of specificity , sensitivity is
desirable than accuracy.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
48
18ECE307J
F score- it combines the precision and recall. It takes the harmonic
mean of precision and recall
• F=(2* precision*recall)/ precision+recall
• Different models can be compared with F-score.
Receiver operating characteristics ROC:
• Visualization is an easier and effective way to understand the model
performance, also helps in comparing the 2 model efficiency.
• A ROC curve is constructed by plotting the true positive rate (TPR)
against the false positive rate (FPR).
• This is a plot of the percentage of true positives on the y axis against
false positives on the x axis
• The true positive rate is the proportion of observations that were
correctly predicted to be positive out of all positive observations
(TP/(TP + FN)).
• Similarly, the false positive rate is the proportion of observations
that are incorrectly predicted to be positive out of all negative
observations (FP/(TN + FP)). 

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
49
18ECE307J
• The ROC curve shows the trade-off between sensitivity (or TPR)
and specificity (1 – FPR).
• Classifiers that give curves closer to the top-left corner indicate a
better performance.
• As a baseline, a random classifier is expected to give points lying
along the diagonal (FPR = TPR).
• The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
50
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
51
18ECE307J
Area under curve (AUC)
• To compare different classifiers, it can be useful to
summarize the performance of each classifier into a single
measure.
• One common approach is to calculate the area under the
ROC curve, which is abbreviated to AUC.
• It is equivalent to the probability that a randomly chosen
positive instance is ranked higher than a randomly chosen
negative instance
• A classifier with high AUC can occassionally score worse in a
specific region than another classifier with lower AUC.
• But in practice, the AUC performs well as a general measure
of predictive accuracy.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
52
18ECE307J
MACHINE LEARNING PROCESS
• Data Collection and Preparation:Machine learning algorithms need significant amounts of data,
preferably without too much noise, but with increased dataset size comes increased
computational costs, and the sweet spot at which there is enough data without excessive
computational overhead is generally impossible to predict.
• Feature Selection:It consists of identifying the features that are most useful for the problem
under examination. This invariably requires prior knowledge of the problem and the data; our
common sense was used in the coins example above to identify some potentially useful features
and to exclude others.
• Algorithm Choice:Given the dataset, the choice of an appropriate algorithm
• Parameter and Model Selection:For many of the algorithms there are parameters that have to be
set manually, or that require experimentation to identify appropriate values
• Training:training should be simply the use of computational resources in order to build a model of
the data
• Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on
data that it was not trained on. This can often include a comparison with human experts in the
field, and the selection of appropriate metrics for this comparison.

Dr.S.Malarvizhi-Prof-ECE-SRM IST-
53
18ECE307J

You might also like