Professional Documents
Culture Documents
⚫ Variance is the variability of model prediction for a given data point or a value which tells
us spread of our data. Model with high variance pays a lot of attention to training data and
does not generalize on the data which it hasn’t seen before. As a result, such models perform
very well on training data but has high error rates on test data. Different samples of training
data
yield different model fits
⚫ Under-fitting:
A statistical model or a machine learning algorithm is said to have under fitting when it
cannot capture the underlying trend of the data.
⚫ Under fitting destroys the accuracy of our machine learning model.
⚫ Its occurrence simply means that our model or the algorithm does not fit the data well
enough.
⚫ It usually happens when we have fewer data to build an accurate model and also when
we try to build a linear model with fewer non-linear data. In such cases, the rules of the
machine learning model are too easy and flexible to be applied on such minimal data
and therefore the model will probably make a lot of wrong predictions.
⚫ Under-fitting can be avoided by using more data and also reducing the features by feature
selection.
⚫ Techniques to reduce under-fitting:
⚫ Increase model complexity
⚫ Increase the number of features, performing feature engineering
⚫ Remove noise from the data.
⚫ Increase the number of epochs or increase the duration of training to get better results.
⚫ Overfitting:
A statistical model is said to be over fitted when we train it with a lot of data When a
model gets trained with so much data, it starts learning from the noise and inaccurate
data entries in our data set. Then the model does not categorize the data correctly,
because of too many details and noise.
⚫ The causes of overfitting are the non-parametric and non-linear methods because these
types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models.
⚫ A solution to avoid overfitting is using a linear algorithm if we have linear data or using
the parameters like the maximal depth if we are using decision trees.
⚫ Techniques to reduce overfitting:
⚫ Increase training data.
⚫ Reduce model complexity.
⚫ Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
3. Training, Testing and Validation
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is collected
from several resources and then pre-processed and organized to provide proper performance of
the model. Type of training data hugely determines the ability of the model to generalize. i.e.
the better the quality and diversity of training data, the better will be the performance of the
model. This data is more than 60% of the total data available for the project.
Testing Set
This dataset is independent of the training set but has a somewhat similar type of probability
distribution of classes and is used as a benchmark to evaluate the model, used only after the
training of the model is complete. Testing set is usually a properly organized dataset having all
kinds of data for scenarios that the model would probably be facing when used in the real
world. Often the validation and testing set combined is used as a testing set which is not
considered a good practice. If the accuracy of the model on training data is greater than that on
testing data then the model is said to have overfitting. This data is approximately 20-25% of the
total data available for the project.
Validation Set
The validation set is used to fine-tune the hyper parameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not learn
from this data, providing an objective unbiased evaluation of the model. Validation dataset can
be utilized for regression as well by interrupting training of model when loss of validation
dataset becomes greater than loss of training dataset.i.e. reducing bias and variance. This data
is approximately 10-15% of the total data available for the project but this can change
depending upon the number of hyper parameters. i.e. if model has quite many hyper parameters
then using large validation set will give better results. Now, whenever the accuracy of model
on validation data is greater than that on training data then the model is said to have generalized
well.
4. Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set and
then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows:
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
In this method, we perform training on the whole data-set but leaves only one data-point of the
available data-set and then iterates for each data-point. It has some advantages as well as
disadvantages
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to higher
variation.
Another drawback is it takes a lot of execution time as it iterates over ‘the number of data
points’ times.
The perceptron is an iterative algorithm that strives to determine, by successive and reiterative
approximations, the best set of values for a vector, w, which is also called the coefficient vector.
Vector w can help predict the class of an example when you multiply it by the matrix of
features,
X (containing the information in numeric values) and then add it to a constant term, called the
bias. The output is a prediction in the sense that the previously described operations output a
number whose sign should be able to predict the class of each example exactly
Connectionism is the approach to machine learning that is based on neuroscience as well as the
example of biologically interconnected networks. You can retrace the root of connectionism to
the perceptron
The perceptron model is a more general computational model than McCulloch-Pitts neuron. It
takes an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more
than some threshold else returns 0. Rewriting the threshold as shown above and making it a
constant input with a variable weight, we would end up with something like the following
Figure: - Perceptron
A single perceptron can only be used to implement linearly separable functions. It
takes both real and boolean inputs and associates a set of weights to them, along with a bias
(the threshold thing I mentioned above). We learn the weights, we get the function. Let's use a
perceptron to learn an OR function.
observing the out-of-sample data too much and adapting to it too often. In short, snooping is a
kind of overfitting — and not just on the training data but also on the test data, making the
overfitting problem itself harder to detect until you get fresh data. Usually you realize that the
problem is snooping when you already have applied the machine learning algorithm to your
business or to a service for the public, making the problem an issue that everyone can see.
when operating on the data, take care to neatly separate training, validation, and test data. Also,
when processing, never take any information from validation or test, even the simplest and
innocent looking examples. Worse still is to apply a complex transformation using all the data.
In finance, for instance, it is well known that calculating the mean and the standard deviation
(which can actually tell you a lot about market conditions and risk) from all training and testing
data can leak precious information about your models. When leakage happens, machine
learning algorithms perform predictions on the test set rather than the out-of-sample data from
the markets, which means that they didn’t work at all, thereby causing a loss of money
If you can’t divide two classes spread on two or more dimensions by any line or plane, they’re
nonlinearly separable. Overcoming data’s being nonlinearly separable is one of the challenges
that machine learning has to accomplish in order to become effective against complex problems
based on real data, not just on artificial data created for academic purposes.
10.Growing Greedy Classification Trees
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision tree
creates classification or regression models as a tree structure. It separates a data set into smaller
subsets, and at the same time, the decision tree is steadily developed. The final tree is a tree with
the decision nodes and leaf nodes. A decision node has at least two branches. The leaf nodes
show a classification or decision. We can't accomplish more split on leaf nodes-The uppermost
decision node in a tree that relates to the best predictor called the root node. Decision trees can
deal with both categorical and numerical data.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain
Advantages of using decision trees:
A decision tree does not need scaling of information.
Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-
processing.
A decision tree does not require a standardization of data.
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.