Professional Documents
Culture Documents
Dr.S.Malarvizhi-Prof-ECE-SRM IST-18ECE307J 1
Regression
• In regression, a single dependent or outcome variable is predicted with
the help of one or more independent variables.
• It is an type of supervised learning, where (X,Y ) the training set , we
need to learn a function , so that given unknown (test) X it should
predict the value of Y.
• In regression the output is continuous.
• Many models could be used – Simplest is linear regression
– Fit data with the best hyper-plane which "goes through" the points
y
dependen
t
variable
(output)
x – independent variable (input)
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
2
18ECE307J
Linear regression
• Given an input x compute an
output y
• For example:
Y
- Predict height from age
- Predict house price from
house area
- Predict distance from wall
from sensors
X
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
3
18ECE307J
Some fits to the data: which is best?
Regression line
Intercept
Slope β1
β0
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
5
18ECE307J
Linear Regression Model
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
6
18ECE307J
Types of Regression Models
Regression
1 feature Models 2+ features
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
7
18ECE307J
House Number Y: Actual Selling X: House Size (100s
Price ft2)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
Sample 15
4 56.9 12.5
houses
5 66.6 18.0
6 82.5 14.3
from the
7 126.3 27.5
region.
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
8
18ECE307J
House price vs size
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
9
18ECE307J
Linear Regression – Multiple Variables
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
10
18ECE307J
Linear Regression
•
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
11
18ECE307J
Criterion for choosing what line to draw:
method of least squares
•
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
12
18ECE307J
The regression line
The least-squares regression line is the unique line such that
the sum of the squared vertical (y) distances between the data
points and the line is the smallest possible.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
13
18ECE307J
How do we "learn" parameters
• For the 2-D problem
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
14
18ECE307J
Multiple Linear Regression
•
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
15
18ECE307J
Problem
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
16
18ECE307J
✔
Defining the line
try to minimize the distance between each datapoint and the line that we fit.
✔ We can measure the distance between a point and a line
✔ Now, we can try to minimize an error function that measures the sum of all these
distances. Minimize the sum-of-squares of the errors-least-squares optimization.
✔ choosing the parameters β in order to minimize the squared difference between
the prediction and the actual data value, summed over all of the datapoints. Given
input vector Y, the prediction is Y β
Two variables :
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
17
18ECE307J
classification problems into regression
problems
• This can be done in two ways
– first by introducing an indicator variable, which simply says which class
each data point belongs to. The problem is now to use the data to predict
the indicator variable, which is a regression problem.
– The second approach is to do repeated regression, once for each class,
with the indicator value being 1 for examples in the class and 0 for all of
the others.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
18
18ECE307J
bias
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
19
18ECE307J
bias
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
20
18ECE307J
Bias
• Data bias in machine learning is a type of error in which certain elements
of a dataset are more heavily weighted and/or represented than others.
• A biased dataset does not accurately represent a model's use case,
resulting in skewed outcomes, low accuracy levels, and analytical errors.
• From Elite Data Science, bias is: “Bias occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… bias is an error from erroneous assumptions in the
learning algorithm. High bias can cause an algorithm to miss the relevant
relations between features and target outputs (underfitting).”
• “Bias is the algorithm’s tendency to consistently learn the wrong thing by
not taking into account all the information in the data (underfitting).”
• A high bias means the prediction will be inaccurate.
• Bias is the difference between the average prediction of our model and
the correct value which we are trying to predict. Model with high bias
pays very little attention to the training data and oversimplifies the model.
It always leads to high error on training and test data.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
21
18ECE307J
• parametric algorithms are prone to high bias. A parametric
algorithm is defined as, “A learning model that summarizes
data with a set of parameters of fixed size (independent of
the number of training examples) is called a parametric
model. No matter how much data you throw at a
parametric model, it won’t change its mind about how
many parameters it needs.”
• A linear regression is an example of a parametric algorithm.
These are easy to understand but not flexible to learn the
underlying signal of the data. Thus, they are inaccurate for
complex datasets.
• Examples of high-bias algorithms include Linear Regression,
Linear Discriminant Analysis, and Logistic Regression.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
22
18ECE307J
Variance
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
23
18ECE307J
Variance
• From EliteDataScience, the variance is: “Variance refers to an algorithm’s
sensitivity to specific sets of the training set occurs when an algorithm has
limited flexibility to learn the true signal from the dataset.”
• Wikipedia states, “… variance is an error from sensitivity to small
fluctuations in the training set. High variance can cause an algorithm to
model the random noise in the training data, rather than the intended
outputs (overfitting).”
• Variance is the algorithm’s tendency to learn random things irrespective of
the real signal by fitting highly flexible models that follow the error/noise in
the data too closely (overfitting).”
• Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data. Model with high variance pays a lot
of attention to training data and does not generalize on the data which it
hasn’t seen before. As a result, such models perform very well on training
data but has high error rates on test data.
• Variance leads to over fitting - in which small fluctuations in the training set
are magnified. A model with high-level variance may reflect random noise
in the training data set instead of the target function
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
24
18ECE307J
• Variance is the difference between many model’s
predictions.
• When we are implementing complicated models. Hence,
any ‘noise’ in the dataset, might be captured by the model.
A high variance tends to occur when we use complicated
models that can overfit our training sets.
• For example: a complicated model might depict people’s
name as a good predictor of our hypothesis.
• However, names are random and should not have any
predictive power.
• In one dataset, people with the name ‘Alex’ can indicate
they are likely to be criminals.
• In another dataset, people with the name ‘Alex’ can indicate
they likely to be graduates. Hence, names should not be
used as a predictive variable.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
25
18ECE307J
What is the TRADE-OFF?
• If you have a simple model, you might conclude that every “Alex”
are amazing people.
• This presents a High Bias and Low Variance problem.
• Your dataset is ‘biased’ towards people with the name Alex. Thus,
most predictions will be similar, since you believe people with
‘Alex’ act a certain way.
• You attempt to fix the model. However, the model is too
complicated.
• Your model has different results for different groups. Thus, Alex
can be a wonderful person, a criminal, an athlete, and a scholar.
• You must find balance! The good thing, if you do Cross-Validation,
you can train on many datasets and average their predictions.
Unfortunately, you cannot minimize bias and variance.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
26
18ECE307J
Low Bias — High Variance:
A low bias and high variance problem is overfitting. Different data sets are
depicting insights given their respective dataset. Hence, the models will predict
differently. However, if average the results, we will have a pretty accurate
prediction.
High Bias — Low Variance:
The predictions will be similar to one another but on average, they are
inaccurate.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
27
18ECE307J
bulls-eye diagram
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
28
18ECE307J
Lessons From Andrew Ng’s Course:
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
29
18ECE307J
Bias and Variance-bulls-eye diagram
• Goal of supervised learning is to learn the target
function, which can best determine the target variable
from the set of input variables.
• Variance is the variability of model prediction for a
given data point or a value which tells us spread of our
data. Model with high variance pays a lot of attention
to training data and does not generalize on the data
which it hasn’t seen before. As a result, such models
perform very well on training data but has high error
rates on test data.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
30
18ECE307J
underfitting and overfitting
In supervised learning, underfitting :
• happens when a model unable to capture the underlying pattern of the data.
• These models usually have high bias and low variance.
• It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data.
• Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting:
• happens when our model captures the noise along with the underlying pattern in data.
• It happens when we train our model a lot over noisy dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which are prone to overfitting.
Underfitting
Goal :- supervised learning is to learn to derive target function which can best determined
the target variable from the set of input variables.
Fitness of a target function by learning algorithm determines how correctly it is able to
classify a set of data it has never seen.
underfitting :
• If the target function is kept too simple, it may not able to capture the essential nuances
(subtle) and represent underlying data well. Happens when a model unable to capture
the underlying pattern of the data.
• It happens when we have very less amount of training data to build an accurate model
• when we try to represent a non-linear data with linear model.
• Also, these kind of models are very simple to capture the complex patterns in data like
Linear and logistic regression.
• Underfitting results in both poor performance in test and training
very complex like Decision trees which are prone to overfitting. These models usually have
high bias and low variance.
Can be avoided by:
– Using more training data
– Reducing features by effective feature selection.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
32
18ECE307J
Overfitting
• This refers to a situation , where the model has been designed in such a way
that it emulates the training data too closely.
• This occurs due to trying to fit an excessively “ complex model” to loosely match
the training data.
• Target function tries to make sure all training data happens when our model
captures the noise along with the underlying pattern in data.
• Any specific deviation in the training data like noise or outliners gets embedded
in the model, it affect the performance of the model on the test data
It happens when we train our model a lot over noisy dataset.
These models have low bias and high variance.
These models provides good performance in traning set poor generalization
To avoid overfitting:
1. Using resampling techniques like cross validation.
2. Removal of nodes which have little or no predictive power for a machine to
learn.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
33
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
34
18ECE307J
How to use Learning Curves to Diagnose Machine
Learning Model Performance
• A learning curve is a plot of model learning performance
over experience or time.
• Learning curves are a widely used diagnostic tool in
machine learning for algorithms that learn from a training
dataset incrementally.
• Learning Curve: Line plot of learning (y-axis) over
experience (x-axis). Train Learning Curve: Learning curve
calculated from the training dataset that gives an idea of
how well the model is learning.
• Validation Learning Curve: Learning curve calculated from
a hold-out validation dataset that gives an idea of how well
the model is generalizing.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
35
18ECE307J
Underfit Learning Curves
• Underfitting refers to a model that cannot learn the
training dataset.
• A plot of learning curves shows underfitting if:
• The training loss remains flat regardless of training.
• The training loss continues to decrease until the end of training.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
36
18ECE307J
Overfit Learning Curves
• Overfitting refers to a model that has learned the training dataset too well,
including the statistical noise or random fluctuations in the training dataset.
• A plot of learning curves shows overfitting if:
– The plot of training loss continues to decrease with experience.
– The plot of validation loss decreases to a point and begins increasing again.
• The inflection point in validation loss may be the point at which training
could be halted as experience after that point shows the dynamics of
overfitting.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
37
18ECE307J
Good Fit Learning Curves
• Good Fit Learning Curves
• A good fit is the goal of the learning algorithm and exists between an overfit and
underfit model.
• A good fit is identified by a training and validation loss that decreases to a point
of stability with a minimal gap between the two final loss values.
• The loss of the model will almost always be lower on the training dataset than
the validation dataset. This means that we should expect some gap between the
train and validation loss learning curves. This gap is referred to as the
“generalization gap.”
• A plot of learning curves shows a good fit if:
• The plot of training loss decreases to a point of stability.
• The plot of validation loss decreases to a point of stability and has a small gap
with the training loss.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
38
18ECE307J
•
The Curse of dimensionality
This refers to a set of problems arise ,when working with higher dimensional data.
• Dimension of a dataset corresponds to the number of features/attributes that are present in a
dataset.
• Dataset with lager number of features generally order of 100 or more , and are referred as high
dimensional data.
• Presenting high dimensional data to the model during analysis (training or visualization) to
identify pattern, rather the model confused.
• Difficulties related to training machine learning models due to high dimensional data is referred
as “ curse of dimensionality”.
• As the number of features grows, the dimension grows, then the amount of data we need to
generalize accurately also grows exponentially.
• The term exponential is bad as in computer science it is related to computation complexity and
time.
• Two facets of curse of dimensionality are:
• 1. data sparsity – high variance(or) over-fitting
• 2.distance concentration ( this refer to the problem of all the pairwise distance between different
sample points in the space converge to the same value as the dimensionality of the data
increases.
Data sparsity
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
40
18ECE307J
• The essence of the curse is the realization that as the
number of dimensions increases, the volume of the unit
hypersphere does not increase with it.
Figure As the dimensionality increases, the
classifier’s performance increases until the
optimal number of features is reached. Further
increasing the dimensionality without
increasing the number of training samples
results in a decrease in classifier performance.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
42
18ECE307J
Classification Error and noise
• For classification:- Confusion Matrix – it is a square matrix,
containing all possible classes in both horizontal and vertical
direction.
• List the classes along the top of the table as predicted output and
down left side targets.
• So, each element in the matrix(i,j) tell us how many input patterns
were put into the class i in the target, but class j by the algorithm.
Diagonal elements are c1c1 , c2c2 are correct classification.
• For class c3, miss classified as c1
c1 c2 c3
c1 5 1 0
c2 1 4 1
c3 2 0 4
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
43
18ECE307J
Two primary types of errors.
– Type 1 errors (false positives) - rejection of a true null
hypothesis
– Type 2 errors (false negatives)- the non-rejection of a
false null hypothesis
• TP true positive - is an observation correctly put
into class 1.
• FP false positive -is an observation incorrectly put
into class 1
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
44
18ECE307J
formulas
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
45
18ECE307J
• Accuracy: Overall, how often is the classifier correct?
– (TP+TN)/total = (100+50)/165 = 0.91
• Misclassification Rate/Error Rate: Overall, how often it
is wrong?
– (FP+FN)/total = (10+5)/165 = 0.09 (1-Accuracy)
• True Positive Rate or "Sensitivity" or "Recall": When
it's actually yes, how often does it predict yes?
– TP/actual yes = 100/105 = 0.95
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
47
18ECE307J
Example: medical –disease prediction ( benign or malignant)- prediction of
malignant- class of interest
• Sensitivity: gives prediction of tumours are actually
malignant and predicted as malignant.
• Specificity: indicates how a good balance of a model
being excessively conservative or excessively
aggressive. Portion of benign tumour which are
correctly classified
• Precision: proportion of positive prediction which are
truly positive.it indicates the reliability of the model
in predicting a class of interest.
• Model with high value of specificity , sensitivity is
desirable than accuracy.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
48
18ECE307J
F score- it combines the precision and recall. It takes the harmonic
mean of precision and recall
• F=(2* precision*recall)/ precision+recall
• Different models can be compared with F-score.
Receiver operating characteristics ROC:
• Visualization is an easier and effective way to understand the model
performance, also helps in comparing the 2 model efficiency.
• A ROC curve is constructed by plotting the true positive rate (TPR)
against the false positive rate (FPR).
• This is a plot of the percentage of true positives on the y axis against
false positives on the x axis
• The true positive rate is the proportion of observations that were
correctly predicted to be positive out of all positive observations
(TP/(TP + FN)).
• Similarly, the false positive rate is the proportion of observations
that are incorrectly predicted to be positive out of all negative
observations (FP/(TN + FP)).
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
49
18ECE307J
• The ROC curve shows the trade-off between sensitivity (or TPR)
and specificity (1 – FPR).
• Classifiers that give curves closer to the top-left corner indicate a
better performance.
• As a baseline, a random classifier is expected to give points lying
along the diagonal (FPR = TPR).
• The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
50
18ECE307J
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
51
18ECE307J
Area under curve (AUC)
• To compare different classifiers, it can be useful to
summarize the performance of each classifier into a single
measure.
• One common approach is to calculate the area under the
ROC curve, which is abbreviated to AUC.
• It is equivalent to the probability that a randomly chosen
positive instance is ranked higher than a randomly chosen
negative instance
• A classifier with high AUC can occassionally score worse in a
specific region than another classifier with lower AUC.
• But in practice, the AUC performs well as a general measure
of predictive accuracy.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
52
18ECE307J
MACHINE LEARNING PROCESS
• Data Collection and Preparation:Machine learning algorithms need significant amounts of data,
preferably without too much noise, but with increased dataset size comes increased
computational costs, and the sweet spot at which there is enough data without excessive
computational overhead is generally impossible to predict.
• Feature Selection:It consists of identifying the features that are most useful for the problem
under examination. This invariably requires prior knowledge of the problem and the data; our
common sense was used in the coins example above to identify some potentially useful features
and to exclude others.
• Algorithm Choice:Given the dataset, the choice of an appropriate algorithm
• Parameter and Model Selection:For many of the algorithms there are parameters that have to be
set manually, or that require experimentation to identify appropriate values
• Training:training should be simply the use of computational resources in order to build a model of
the data
• Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on
data that it was not trained on. This can often include a comparison with human experts in the
field, and the selection of appropriate metrics for this comparison.
Dr.S.Malarvizhi-Prof-ECE-SRM IST-
53
18ECE307J