You are on page 1of 13

Unit: 4 Validating Machine Learning Models

Validating Machine Learning: Checking Out-of-Sample Errors, getting to Know the


Limits of Bias, Keeping Model Complexity in Mind and Solutions Balanced, Training,
Validating, and Testing, Resorting to Cross Validation. Looking for Alternatives in
Validation. Optimizing Cross-Validation Choices, Avoiding Sample Bias and Leakage Traps,
Discovering the Incredible Perceptron Simplest learning strategies to learn from Data:
Discovering the Incredible Perceptron, Growing Greedy Classification Trees, Taking a
Probabilistic Turn

1. Checking Out-of-Sample Errors


Statistics expect that the future won’t differ too much from the past. Thus, you can
base future predictions on past data by employing random sampling theory. If you
select examples randomly without a criterion, you do have a good chance of choosing
a selection of examples that won’t differ much from future examples, or, in statistical
terms, you can expect that the distribution of your present sample will closely
resemble the distribution of future samples
As an example, if you receive sales data from just one shop or only the shops in a
single region (which is actually a specific sample), the algorithm may not learn how to
forecast the future sales of all the shops in all the regions. The specific sample causes
problems because other shops may be different and follow different rules from the
ones you’re observing
Ensuring that your algorithm is learning correctly from data is the reason you should
always check what the algorithm has learned from in-sample data (the data used for
training) by testing your hypothesis on some out-of-sample data. Out-of-sample data
is data you didn’t have at learning time, and it should represent the kind of data you
need to create forecasts
Looking for generalization
Generalization is the capability to learn from data at hand the general rules that
you can apply to all other data. Out-of-sample data therefore becomes essential to
figuring out whether learning from data is possible, and to what extent
Such classical examples of selection bias point out that if the selection process biases
a sample, the learning process will have the same bias. However, sometimes bias is
unavoidable and difficult to spot. As an example, when you go fishing with a net,
you can see only the fish you catch and that didn’t pass through the net itself
Another example comes from World War II. At that time, designers constantly
improved U.S. war planes by adding extra armor plating to the parts that took the
most hits upon returning from bombing runs. It took the reasoning of the
mathematician Abraham Wald to point out that designers actually needed to
reinforce the places that didn’t have bullet holes on returning planes. These
locations were likely so critical that a plane hit there didn’t return home, and
consequently no one could observe its damage (a kind of survivorship bias where
the survivors skew the data). Survivorship bias is still a problem today

2. Getting to Know the Limits of Bias


Bias is the difference between the average prediction of our model and the correct
value. Model with high bias pays very little attention to the training data and
oversimplifies the model. It always leads to high error on training and test data.
Regardless of training sample, or size of training sample, model will produce
consistent errors

Figure: - High Bias

⚫ Variance is the variability of model prediction for a given data point or a value which tells
us spread of our data. Model with high variance pays a lot of attention to training data and
does not generalize on the data which it hasn’t seen before. As a result, such models perform
very well on training data but has high error rates on test data. Different samples of training
data
yield different model fits

Figure: - High Variance

⚫ Under-fitting:
A statistical model or a machine learning algorithm is said to have under fitting when it
cannot capture the underlying trend of the data.
⚫ Under fitting destroys the accuracy of our machine learning model.
⚫ Its occurrence simply means that our model or the algorithm does not fit the data well
enough.
⚫ It usually happens when we have fewer data to build an accurate model and also when
we try to build a linear model with fewer non-linear data. In such cases, the rules of the
machine learning model are too easy and flexible to be applied on such minimal data
and therefore the model will probably make a lot of wrong predictions.
⚫ Under-fitting can be avoided by using more data and also reducing the features by feature
selection.
⚫ Techniques to reduce under-fitting:
⚫ Increase model complexity
⚫ Increase the number of features, performing feature engineering
⚫ Remove noise from the data.
⚫ Increase the number of epochs or increase the duration of training to get better results.
⚫ Overfitting:
A statistical model is said to be over fitted when we train it with a lot of data When a
model gets trained with so much data, it starts learning from the noise and inaccurate
data entries in our data set. Then the model does not categorize the data correctly,
because of too many details and noise.
⚫ The causes of overfitting are the non-parametric and non-linear methods because these
types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models.

⚫ A solution to avoid overfitting is using a linear algorithm if we have linear data or using
the parameters like the maximal depth if we are using decision trees.
⚫ Techniques to reduce overfitting:
⚫ Increase training data.
⚫ Reduce model complexity.
⚫ Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
3. Training, Testing and Validation
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is collected
from several resources and then pre-processed and organized to provide proper performance of
the model. Type of training data hugely determines the ability of the model to generalize. i.e.
the better the quality and diversity of training data, the better will be the performance of the
model. This data is more than 60% of the total data available for the project.

Testing Set
This dataset is independent of the training set but has a somewhat similar type of probability
distribution of classes and is used as a benchmark to evaluate the model, used only after the
training of the model is complete. Testing set is usually a properly organized dataset having all
kinds of data for scenarios that the model would probably be facing when used in the real
world. Often the validation and testing set combined is used as a testing set which is not
considered a good practice. If the accuracy of the model on training data is greater than that on
testing data then the model is said to have overfitting. This data is approximately 20-25% of the
total data available for the project.
Validation Set
The validation set is used to fine-tune the hyper parameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not learn
from this data, providing an objective unbiased evaluation of the model. Validation dataset can
be utilized for regression as well by interrupting training of model when loss of validation
dataset becomes greater than loss of training dataset.i.e. reducing bias and variance. This data
is approximately 10-15% of the total data available for the project but this can change
depending upon the number of hyper parameters. i.e. if model has quite many hyper parameters
then using large validation set will give better results. Now, whenever the accuracy of model
on validation data is greater than that on training data then the model is said to have generalized
well.

4. Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set and
then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows:
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.

Methods of Cross Validation


Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for
the testing purpose. The major drawback of this method is that we perform training on the
50% of the dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model i.e higher bias.

LOOCV (Leave One Out Cross Validation)

In this method, we perform training on the whole data-set but leaves only one data-point of the
available data-set and then iterates for each data-point. It has some advantages as well as
disadvantages
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to higher
variation.
Another drawback is it takes a lot of execution time as it iterates over ‘the number of data
points’ times.

K-Fold Cross Validation


In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.

Advantages of train/test split:


1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and testing.
5. Hyper parameters
A hyper parameter is a parameter that is set before the learning process begins. These
parameters are tunable and can directly affect how well a model trains. Some examples of hyper
parameters in machine learning:
1. Learning Rate
2. Number of Epochs
3. Momentum
4. Regularization constant
5. Number of branches in a decision tree
6. Number of clusters in a clustering algorithm (like k-means)

Optimizing Hyper parameters


Hyper parameters can have a direct impact on the training of machine learning algorithms.
Thus, in order to achieve maximal performance, it is important to understand how to optimize
them. Here are some common strategies for optimizing hyper parameters:
1. Grid Search: Search a set of manually predefined hyper parameters for the best performing
hyper parameter. Use that value. (This is the traditional method)
2. Random Search: Similar to grid search, but replaces the exhaustive search with random
search. This can outperform grid search when only a small number of hyper parameters are
needed to actually optimize the algorithm.
3.Bayesian Optimization: Builds a probabilistic model of the function mapping from hyper
parameter values to the target evaluated on a validation set.
4.Gradient-Based Optimization: Compute gradient using hyperparameters and then optimize
hyper parameters using gradient descent.
5. Evolutionary Optimization
: Uses evolutionary algorithms (e.g. genetic functions) to search the space of possible hyper
parameters.

6. Discovering the Incredible Perceptron

The perceptron is an iterative algorithm that strives to determine, by successive and reiterative
approximations, the best set of values for a vector, w, which is also called the coefficient vector.
Vector w can help predict the class of an example when you multiply it by the matrix of
features,
X (containing the information in numeric values) and then add it to a constant term, called the
bias. The output is a prediction in the sense that the previously described operations output a
number whose sign should be able to predict the class of each example exactly

Connectionism is the approach to machine learning that is based on neuroscience as well as the
example of biologically interconnected networks. You can retrace the root of connectionism to
the perceptron

The perceptron model is a more general computational model than McCulloch-Pitts neuron. It
takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more
than some threshold else returns 0. Rewriting the threshold as shown above and making it a
constant input with a variable weight, we would end up with something like the following

Figure: - Perceptron
A single perceptron can only be used to implement linearly separable functions. It
takes both real and boolean inputs and associates a set of weights to them, along with a bias
(the threshold thing I mentioned above). We learn the weights, we get the function. Let's use a
perceptron to learn an OR function.

7. Exploring the space of hyper-parameters


The possible combinations of values that hyper-parameters may form make deciding where to
look for optimizations hard. in gradient descent, an optimization space may contain value
combinations that perform better or worse. Even after you find a good combination, you’re not
assured that it’s the best option.
As a practical way of solving this problem, the best way to verify hyper-parameters for an
algorithm applied to specific data is to test them all by cross-validation, and to pick the best
combination. This simple approach, called grid-search, offers indisputable advantages by
allowing you to sample the range of possible values to input into the algorithm systematically
and to spot when the general minimum happens. On the other hand, grid-search also has
serious drawbacks because it’s computationally intensive (you can easily perform this task in
parallel on modern multicore computers) and quite time consuming. Moreover, systematic and
intensive tests enhance the possibility of incurring error because some good but fake validation
results can be caused by noise present in the dataset

8. Avoiding Sample Bias and Leakage Traps


The remedy, which is called ensembling of predictors, works perfectly when your training
sample is not completely distorted and its distribution is different from the out-of-sample, but
not in an irremediable way, such as when all your classes are present but not in the right
proportion (as an example). In such cases, your results are affected by a certain variance of the
estimates that you can possibly stabilize in one of several ways: by resampling, as in
bootstrapping; by subsampling (taking a sample of the sample); or by using smaller samples
(which increases bias). In most cases, such an approach proves to be correct and improves your
machine learning predictions a lot. When your problem is bias and not variance, using
ensembling really doesn’t cause harm unless you subsample too few samples. A good rule of
thumb for subsampling is to take a sample from 70 to 90 percent compared to the original in-
sample data. If you want to make ensembling work, you should do the following:
1. Iterate a large number of times through your data and models (from just a minimum of three
iterations to ideally hundreds of times of them).
2. Every time you iterate, subsample (or else bootstrap) your in-sample data.
3. Use machine learning for the model on the resampled data, and predict the out-of-sample
results. Store those results away for later use.
4. At the end of the iterations, for every out-of-sample case you want to predict, take all its
predictions and average them if you are doing a regression. Take the most frequent class if you
are doing a classification

observing the out-of-sample data too much and adapting to it too often. In short, snooping is a
kind of overfitting — and not just on the training data but also on the test data, making the
overfitting problem itself harder to detect until you get fresh data. Usually you realize that the
problem is snooping when you already have applied the machine learning algorithm to your
business or to a service for the public, making the problem an issue that everyone can see.

when operating on the data, take care to neatly separate training, validation, and test data. Also,
when processing, never take any information from validation or test, even the simplest and
innocent looking examples. Worse still is to apply a complex transformation using all the data.
In finance, for instance, it is well known that calculating the mean and the standard deviation
(which can actually tell you a lot about market conditions and risk) from all training and testing
data can leak precious information about your models. When leakage happens, machine
learning algorithms perform predictions on the test set rather than the out-of-sample data from
the markets, which means that they didn’t work at all, thereby causing a loss of money

9. Discovering the Incredible Perceptron

If you can’t divide two classes spread on two or more dimensions by any line or plane, they’re
nonlinearly separable. Overcoming data’s being nonlinearly separable is one of the challenges
that machine learning has to accomplish in order to become effective against complex problems
based on real data, not just on artificial data created for academic purposes.
10.Growing Greedy Classification Trees
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision tree
creates classification or regression models as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily developed. The final tree is a tree with
the decision nodes and leaf nodes. A decision node has at least two branches. The leaf nodes
show a classification or decision. We can't accomplish more split on leaf nodes-The uppermost
decision node in a tree that relates to the best predictor called the root node. Decision trees can
deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain
Advantages of using decision trees:
A decision tree does not need scaling of information.
Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-
processing.
A decision tree does not require a standardization of data.

11.Taking a Probabilistic Turn

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

You might also like