CSL0777 L08

Program: B.
Tech VII Semester

CSL0777: Machine Learning
Unit No. 1
More about datasets
Lecture No. 08
Mr. Praveen Gupta

Assistant Professor, CSA/SOET
Outlines
• Training Data vs. Validation Data vs. Test Data
• Splitting of data
• Bias-Variance Trade off
• Under fitting vs over fitting
• References
Student Effective Learning Outcomes(SELO)
01: Ability to understand subject related concepts clearly along with contemporary
issues.
02: Ability to use updated tools, techniques and skills for effective domain specific
practices.
03: Understanding available tools and products and ability to use it effectively.
Training Data vs. Validation Data vs. Test Data
Effective machine learning (ML) algorithms require quality training and
testing data — and often lots of it — to make accurate predictions. Different
datasets serve different purposes in preparing an algorithm to make
predictions and decisions based on real-world data.
training data vs. test data vs. validation data, all three are typically split from
one large dataset, each one typically has its own distinct use in ML modeling.
Training data. This type of data builds up the machine learning algorithm. The data
scientist feeds the algorithm input data, which corresponds to an expected output. The
model evaluates the data repeatedly to learn more about the data’s behavior and then adjusts
itself to serve its intended purpose.
Validation data. During training, validation data infuses new data into the model that it
hasn’t evaluated before. Validation data provides the first test against unseen data, allowing
data scientists to evaluate how well the model makes predictions based on the new data. Not
all data scientists use validation data, but it can provide some helpful information to
optimize hyperparameters, which influence how the model assesses data.
Test data. After the model is built, testing data once again validates that it can make
accurate predictions. If training and validation data include labels to monitor performance
metrics of the model, the testing data should be unlabeled. Test data provides a final, real-
world check of an unseen dataset to confirm that the ML algorithm was trained effectively.
Training data vs. validation data:
•ML algorithms require training data to achieve an objective. The algorithm will analyze this
training dataset, classify the inputs and outputs, then analyze it again. Trained enough, an
algorithm will essentially memorize all of the inputs and outputs in a training dataset — this
becomes a problem when it needs to consider data from other sources, such as real-world
customers.
•Here is where validation data is useful. Validation data provides an initial check that the model
can return useful predictions in a real-world setting, which training data cannot do. The ML
algorithm can assess training data and validation data at the same time.
•Validation data is an entirely separate segment of data, though a data scientist might carve out
part of the training dataset for validation — as long as the datasets are kept separate throughout
the entirety of training and testing.
Training data vs. validation data:
•For example, let’s say an ML algorithm is supposed to analyze a picture of a vertebrate and
provide its scientific classification. The training dataset would include lots of pictures of
mammals, but not all pictures of all mammals, let alone all pictures of all vertebrates.
•So, when the validation data provides a picture of a squirrel, an animal the model hasn’t seen
before, the data scientist can assess how well the algorithm performs in that task. This is a check
against an entirely different dataset than the one it was trained on.
•Based on the accuracy of the predictions after the validation stage, data scientists can adjust
hyperparameters such as learning rate, input features and hidden layers. These adjustments
prevent overfitting, in which the algorithm can make excellent determinations on the training
data, but can't effectively adjust predictions for additional data.
•The opposite problem, underfitting, occurs when the model isn’t complex enough to make
accurate predictions against either training data or new data.
Validation data vs. testing data
•Not all data scientists rely on both validation data and testing data. To some degree, both
datasets serve the same purpose: make sure the model works on real data.
•However, there are some practical differences between validation data and testing data. If you
opt to include a separate stage for validation data analysis, this dataset is typically labeled so the
data scientist can collect metrics that they can use to better train the model.
•In this sense, validation data occurs as part of the model training process. Conversely, the model
acts as a black box when you run testing data through it. Thus, validation data tunes the model,
whereas testing data simply confirms that it works.
Splitting the Data into Training and Evaluation Data
•Evaluating the model with the same data that was used for training is not useful,
because it rewards models that can “remember” the training data, as opposed to
generalizing from it.
•A common strategy is to take all available labeled data, and split it into training and
evaluation subsets, usually with a ratio of 70-80 percent for training and 20-30
percent for evaluation.
•The ML system uses the training data to train models to see patterns, and uses the
evaluation data to evaluate the predictive quality of the trained model.
•The ML system evaluates predictive performance by comparing predictions on the
evaluation data set with true values (known as ground truth) using a variety of
metrics. Usually, you use the “best” model on the evaluation subset to make
predictions on future instances for which you do not know the target answer.
Bias-Variance Trade off
It is important to understand prediction errors (bias and variance) when it
comes to accuracy in any machine learning algorithm. There is a tradeoff
between a model’s ability to minimize bias and variance which is referred to
as the best solution for selecting a value of Regularization constant.
Proper understanding of these errors would help to avoid the overfitting and
underfitting of a data set while training the algorithm.
Bias:
•The bias is known as the difference between the prediction of the values by
the ML model and the correct value. Being high in biasing gives a large error
in training as well as testing data. Its recommended that an algorithm should
always be low biased to avoid the problem of underfitting.
•By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of
Data. This happens when the hypothesis is too simple or linear in nature.
HighBias
Variance:
•The variability of model prediction for a given data point which tells us
spread of our data is called the variance of the model. The model with high
variance has a very complex fit to the training data and thus is not able to fit
accurately on the data which it hasn’t seen before.
•As a result, such models perform very well on training data but has high
error rates on test data. When a model is high on variance, it is then said
to as Overfitting of Data.
•Overfitting is fitting the training set accurately via complex curve and high
order hypothesis but is not the solution as the error with unseen data is high.
While training a data model variance should be kept low..
High Variance
•If the algorithm is too simple (hypothesis with linear eq.) then it may be on
high bias and low variance condition and thus is error-prone.
•If algorithms fit too complex ( hypothesis with high degree eq.) then it may
be on high variance and low bias. In the latter condition, the new entries will
not perform well.
•Well, there is something between both of these conditions, known as Trade-
off or Bias Variance Trade-off.
•This tradeoff in complexity is why there is a tradeoff between bias and
variance. An algorithm can’t be more complex and less complex at the same
time.
The best fit will be given by hypothesis on the tradeoff point.

The error to complexity graph to show trade-off

This is referred to as the best point chosen for the training of the algorithm which gives low
error in training as well as testing data.
Underfitting and Overfitting in Machine Learning
Let us consider that we are designing a machine learning model. A model is
said to be a good machine learning model if it generalizes any new input data
from the problem domain in a proper way. This helps us to make predictions
in the future data, that data model has never seen.
Now, suppose we want to check how well our machine learning model learns
and generalizes to the new data. For that we have overfitting and underfitting,
which are majorly responsible for the poor performances of the machine
learning algorithms.
let’s understand two important terms:
Bias – Assumptions made by a model to make a function easier to learn.
Variance – If you train your data on training data and obtain a very low error,
upon changing the data and then training the same previous model you
experience high error, this is variance.
Underfitting:
•A statistical model or a machine learning algorithm is said to have underfitting when
it cannot capture the underlying trend of the data. (It’s just like trying to fit undersized
pants!) Underfitting destroys the accuracy of our machine learning model.
•Its occurrence simply means that our model or the algorithm does not fit the data well
enough. It usually happens when we have less data to build an accurate model and also
when we try to build a linear model with a non-linear data.
•In such cases the rules of the machine learning model are too easy and flexible to be
applied on such minimal data and therefore the model will probably make a lot of
wrong predictions.
•Underfitting can be avoided by using more data and also reducing the features by
feature selection.
Underfitting – High bias and low variance
Techniques to reduce underfitting :
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4.Increase the number of epochs or increase the duration of training to get
better results.
Overfitting:
A statistical model is said to be overfitted, when we train it with a lot of data (just like
fitting ourselves in oversized pants!). When a model gets trained with so much of
data, it starts learning from the noise and inaccurate data entries in our data set. Then
the model does not categorize the data correctly, because of too many details and
noise.
•The causes of overfitting are the non-parametric and non-linear methods because
these types of machine learning algorithms have more freedom in building the model
based on the dataset and therefore they can really build unrealistic models.
•A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
Overfitting – High variance and low bias
Techniques to reduce overfitting :
1. Increase training data.
2. Reduce model complexity.
3.Early stopping during the training phase (have an eye over the loss over
the training period as soon as loss begins to increase stop training).
Good Fit in a Statistical Model:
•Ideally, the case when the model makes the predictions with 0 error, is said to have
a good fit on the data. This situation is achievable at a spot between overfitting and
underfitting. In order to understand it we will have to look at the performance of our
model with the passage of time, while it is learning from training dataset.
•With the passage of time, our model will keep on learning and thus the error for the
model on the training and testing data will keep on decreasing. If it will learn for too
long, the model will become more prone to overfitting due to the presence of noise
and less useful details. Hence the performance of our model will decrease.
•In order to get a good fit, we will stop at a point just before where the error starts
increasing. At this point the model is said to have good skills on training datasets as
well as our unseen testing dataset.
Learning Outcomes
The students have learn and understand the followings:
•Training Data vs. Validation Data vs. Test Data
•Splitting of data
•Bias-Variance Trade off
•Under fitting vs over fitting
•References
References
1. Machine Learning for Absolute Beginners by Oliver Theobald. 2019
2. http://noracook.io/Books/Python/introductiontomachinelearningwithpython.pdf
3. https://www.tutorialspoint.com/machine_learning_with_python/machine_learnin
g_with_python_tutorial.pdf
Thank you

CSL0777 L08

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSL0777 L08

Uploaded by

Copyright:

Available Formats

Program: B.

Tech VII Semester

Mr. Praveen Gupta

The best fit will be given by hypothesis on the tradeoff point.

The error to complexity graph to show trade-off

You might also like