ML Module Iii

Module-III
1.What is Cross Validation?

Cross validation is a technique used in machine learning to evaluate the performance of a model on
unseen data. It involves dividing the available data into multiple folds or subsets, using one of these
folds as a validation set, and training the model on the remaining folds. This process is repeated
multiple times, each time using a different fold as the validation set. Finally, the results from each
validation step are averaged to produce a more robust estimate of the model’s performance. Cross
validation is an important step in the machine learning process and helps to ensure that the model
selected for deployment is robust and generalizes well to new data.
What is cross-validation used for?
The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the model’s
generalization performance, i.e., its ability to perform well on new, unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation, leave-
one-out cross validation, and Holdout validation, Stratified Cross-Validation. The choice of
technique depends on the size and nature of the data, as well as the specific requirements of the
modelling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is used for
the testing purpose. It’s a simple and quick way to evaluate a model. The major drawback of this
method is that we perform training on the 50% of the dataset, it may possible that the remaining
50% of the data contains some important information which we are leaving while training our
model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-point of the
available dataset and then iterates for each data-point. In LOOCV, the model is trained on samples
and tested on the one omitted sample, repeating this process for each data point in the dataset. It has
some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we
are testing against one data point. If the data point is an outlier it can lead to higher variation.
Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’
times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation process
maintains the same class distribution as the entire dataset. This is particularly important when
dealing with imbalanced datasets, where certain classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each
fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used for
training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well to unseen
data.
4. K-Fold Cross Validation
In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset reserved for testing purpose each
time.
Note: It is always suggested that the value of k should be 10 as the lower value of k is takes towards
validation and higher value of k leads to LOOCV method.
Example of K Fold Cross Validation

The diagram below shows an example of the training subsets and evaluation subsets generated in k-
fold cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent
of data for evaluation, and the remaining 80 percent for training ([1-5] testing and [5-25] training)
while in the second iteration we use the second subset of 20 percent for evaluation, and the
remaining three subsets of the data for training ([5-10] testing and [1-5 and 10-25] training), and so
on.
Total instances: 25
Value of k: 5
No. Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]

4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
2.Write short note on

a.Gini Impurity
b.Entropy
c.Information gain
a. Gini Impurity:
Definition:
Gini Impurity is a measure of how often a randomly chosen element from a set would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in the set.
It is commonly used in decision tree algorithms for classification problems.
Formula:
Where:
- D is the dataset.
- c is the number of classes.
- pi is the probability of choosing an element of class i.
Interpretation:
A Gini score of 0 indicates perfect purity, meaning all elements in the set belong to the same
class, while a Gini score of 1 indicates maximum impurity.
b. Entropy:
Definition:
Entropy is a measure of disorder or uncertainty in a set of data. In the context of decision
trees, it is used to quantify the amount of information contained in the dataset. It is often
employed as a criterion for splitting nodes in decision trees.
Formula:
Where:
- D is the dataset.
- c is the number of classes.
- pi is the probability of choosing an element of class i.
Interpretation:
A lower entropy indicates a more ordered and pure dataset, while higher entropy suggests
greater disorder and uncertainty.
c. Information Gain:
Definition:
Information Gain is a metric used in decision tree algorithms to determine the effectiveness of
a particular attribute in classifying the data. It measures the reduction in entropy or Gini
impurity achieved by splitting the data based on a given attribute.
Formula (for Entropy):
Where:
- D is the dataset.
- A is the attribute based on which the dataset is split.
- V is the number of values of attribute (A).
- Dv is the subset of (D) for which attribute (A) has the v-th value.
Interpretation:
Higher Information Gain suggests that the attribute is more effective in reducing uncertainty
and better at classifying the data.
3.Define decision Tree.Explain about Training and visualizing a Decision Tree
Decision Tree:
Definition:
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It recursively splits the dataset into subsets based on the most significant
attribute at each node. The goal is to create a tree-like model where each internal node
represents a decision based on an attribute, each branch represents the outcome of the
decision, and each leaf node represents the final prediction or classification.
Training a Decision Tree:
1. Selecting Attributes:
- Choose the most appropriate attribute to split the data. The selection is often based on
metrics like Information Gain or Gini Impurity for classification problems and variance
reduction for regression problems.
2. Splitting:
- Split the dataset into subsets based on the chosen attribute. Each subset corresponds to a
unique value of the chosen attribute.
3. Recursive Process:
- Repeat the process recursively for each subset until a stopping condition is met. Stopping
conditions may include reaching a maximum depth, having a minimum number of samples in
a node, or achieving perfect purity.
4. Labeling Leaves:
- Assign a class label (for classification) or a predicted value (for regression) to each leaf
node based on the majority class or average value of the samples in that leaf.
Visualizing a Decision Tree:
1. Graphical Representation:
- Decision Trees can be visualized graphically, making it easy to interpret the decision-
making process. Each node in the tree represents a decision, and each branch represents an
outcome.
2. Node Representation:
- Nodes are annotated with the attribute and the threshold (for numerical attributes) or the
value (for categorical attributes) used for splitting.
3. Leaf Nodes:
- Leaf nodes display the predicted class or value. The size of the leaf node may be
proportional to the number of samples it represents.
4. Visualization Tools:
- Python libraries like `scikit-learn` provide functions to visualize decision trees using tools
like Graphviz. Visualization helps in understanding the structure of the tree, identifying
important features, and explaining the decision-making process to stakeholders.
5. Interpretability:
- Decision Trees are inherently interpretable, and visualizing them enhances their
interpretability. It allows users to trace the decision path from the root to a leaf and understand
the criteria for classification or prediction.
4.What is Boosting?Explain about AdaBoost in detail.
Boosting:
Boosting is an ensemble learning technique in machine learning that combines the predictions of
multiple weak learners (models that are slightly better than random chance) to create a strong learner.
The key idea behind boosting is to sequentially train weak models and give more weight to the
instances that are misclassified or have higher errors in the previous models. This helps in focusing on
the difficult-to-classify examples, improving overall model performance.
AdaBoost (Adaptive Boosting):

AdaBoost is one of the most popular boosting algorithms. It was introduced by Yoav Freund and
Robert Schapire in 1996. AdaBoost works by iteratively training a series of weak learners on the
dataset. After each iteration, it assigns higher weights to the misclassified instances, making them
more influential in the subsequent model training. The final strong model is a weighted sum of the
weak learners' predictions.
AdaBoost Algorithm:
1. Initialize Weights:
- Assign equal weights to all instances in the training set.
2. Iterative Training:
- For each iteration, train a weak learner (e.g., a decision tree) on the current weighted dataset.
3. Compute Error:
- Compute the error of the weak learner by summing the weights of the misclassified instances.
4. Compute Model Weight:
- Compute the weight of the weak learner based on its error. More accurate models get higher
weights.
5. Update Weights:
- Increase the weights of the misclassified instances, making them more influential for the next
iteration.
6. Repeat:
- Repeat steps 2-5 for a predefined number of iterations or until a satisfactory performance is
achieved.
7. Final Model:
- Combine the weak learners into a strong model by assigning weights to their predictions based on
their individual performance.
Predictions with AdaBoost:
To make predictions using AdaBoost, the predictions from each weak learner are combined, and the
final output is determined by a weighted majority vote. The weights are assigned based on the
accuracy of each weak learner.
Key Advantages of AdaBoost:
1. Adaptability:
- AdaBoost adapts over iterations by assigning more weight to misclassified instances, focusing on
difficult-to-classify examples.
2. Robustness:
- It is less prone to overfitting compared to individual weak learners.
3. Versatility:
- AdaBoost can be used with various base learners, making it versatile for different types of weak
models.
4. High Accuracy:
- AdaBoost often achieves high accuracy by combining the strengths of multiple weak learners.
However, AdaBoost can be sensitive to noisy data and outliers, as it tries to fit them during the
training process. Despite this, it remains a widely used and effective algorithm in practice.
5 a. Describe Gradient Boosting?

b. Consider a dataset with only one attribute(categorical). Suppose, there are 10 unordered
values in this attribute, how many possible combinations are needed to find the best split-
point for building the decision tree classifier? (considering only binary splits)
a. Gradient Boosting is another powerful ensemble learning technique that, like AdaBoost, combines
the predictions of multiple weak learners to create a strong learner. However, Gradient Boosting
builds the ensemble in a different way. It focuses on minimizing the error of the overall ensemble by
sequentially adding weak learners, each one correcting the errors of its predecessors.
Here is an overview of the Gradient Boosting process:
1. Initialize the Model:

- The algorithm starts with an initial model, often a simple one like the mean of the target variable
for regression problems or a class with the highest frequency for classification problems.
2. Compute Residuals:
- For each sample in the training set, compute the difference between the actual target and the
predicted value from the current ensemble. These differences are called residuals.
3. Train Weak Learner:
- Fit a weak learner (e.g., decision tree) to predict the residuals. The goal is to correct the errors
made by the current ensemble.
4. Compute Learning Rate:
- Introduce a learning rate parameter (commonly denoted as \(\eta\)) to control the contribution of
each weak learner. It scales the contribution of each tree, helping to prevent overfitting.
5. Update Ensemble:
- Update the ensemble by adding the weak learner's prediction scaled by the learning rate. This
adjusts the overall prediction closer to the true target values.
6. Repeat:
- Repeat steps 2-5 for a predefined number of iterations or until a stopping criterion is met.
7. Final Model:
- The final model is the sum of all the weak learners' predictions.
Key Features of Gradient Boosting:
1. Sequential Training:
- Unlike AdaBoost, which assigns weights to instances, Gradient Boosting builds the ensemble
sequentially, with each new weak learner focusing on correcting the errors of the existing ensemble.
2. Gradient Descent:
- The name "Gradient Boosting" comes from the fact that the algorithm uses gradient descent
optimization to minimize the overall error.
3. Flexible:
- Gradient Boosting is flexible and can be used for both regression and classification tasks.
4. Robust to Outliers:
- It is less sensitive to outliers compared to other algorithms like AdaBoost.
5. High Predictive Accuracy:
- Gradient Boosting often achieves high predictive accuracy and is widely used in practice for
various machine learning problems.
Popular implementations of Gradient Boosting include the Gradient Boosting Machines (GBM),
XGBoost, LightGBM, and CatBoost, each with its optimizations and enhancements for efficiency and
performance.
b. Suppose we have unordered values; the total possible splits would be 2 q-1 -1.
Therefore, 29-1 = 511.
6.Which method prevents over fitting in decision tress?

Several methods can be employed to prevent overfitting in decision trees, ensuring that the model
generalizes well to unseen data. Here are some common techniques:
1. Pruning:
- Pruning involves removing parts of the tree that do not provide significant power in predicting
target values. There are two main types of pruning:
- Pre-Pruning (Early Stopping): Stop growing the tree before it reaches a certain depth or node
size.
- Post-Pruning: Allow the tree to grow fully and then prune it by removing nodes that do not add
much predictive power.
2. Limiting Tree Depth:

- Restricting the maximum depth of the tree can prevent it from becoming too complex and
overfitting the training data. This is a form of pre-pruning.
3. Minimum Samples per Leaf:

- Set a minimum number of samples required to be present in a leaf node. This helps prevent the
creation of nodes that only fit a small number of instances, which might be noise in the data.
4. Minimum Samples per Split:

- Specify a minimum number of samples required to split an internal node. This ensures that nodes
with very few samples are not split, preventing overfitting.
5. Maximum Features:
- Limit the number of features considered for splitting at each node. This is particularly useful in
datasets with a large number of features to prevent the model from fitting the noise in the data.
6. Cross-Validation:
- Utilize cross-validation techniques to assess the model's performance on different subsets of the
training data. This helps in identifying whether the model is overfitting or not.
7. Ensemble Methods:
- Instead of relying on a single decision tree, use ensemble methods like Random Forest or Gradient
Boosting, which combine multiple trees to reduce overfitting.
8. Tuning Hyperparameters:
- Experiment with hyperparameters like the learning rate, minimum samples per split, or maximum
depth to find the best configuration that prevents overfitting.
9. Feature Engineering:
- Carefully select and preprocess features to provide a cleaner input to the decision tree. Removing
irrelevant or redundant features can help prevent overfitting.
10. Regularization Techniques:

- Some decision tree algorithms provide regularization parameters that penalize overly complex
trees. For example, scikit-learn's DecisionTreeClassifier has the `ccp_alpha` parameter for cost-
complexity pruning.
The choice of which method or combination of methods to use depends on the specific characteristics
of the dataset and the problem at hand. It's often beneficial to try multiple approaches and evaluate
their impact on the model's performance using validation or test datasets.
7.Differentiate between Gradient Descent and Stochastic Descent
Batch Gradient Descent Stochastic Gradient Descent
Computes gradient using the whole

Computes gradient using a single Training sample
Training sample
Slow and computationally expensive Faster and less computationally expensive than Batch
algorithm GD
Not suggested for huge training

Can be used for large training samples.
samples.
Deterministic in nature. Stochastic in nature.
Gives optimal solution given sufficient

Gives good solution but not optimal.
time to converge.
Batch Gradient Descent Stochastic Gradient Descent
The data sample should be in a random order, and this

No random shuffling of points are
is why we want to shuffle the training set for every
required.
epoch.
Can’t escape shallow local minima

SGD can escape shallow local minima more easily.
easily.
Convergence is slow. Reaches the convergence much faster.
It updates the model parameters only It updates the parameters after each individual data
after processing the entire training set. point.
The learning rate is fixed and cannot be

The learning rate can be adjusted dynamically.
changed during training.
It typically converges to the global

It may converge to a local minimum or saddle point.
minimum for convex loss functions.
It may suffer from overfitting if the It can help reduce overfitting by updating the model
model is too complex for the dataset. parameters more frequently.
8.Explain Naive Bayes Theorem with an example

Naive Bayes Theorem:
Naive Bayes is a probabilistic algorithm based on Bayes' theorem, which is a fundamental theorem in
probability. It makes the assumption that the features used to describe an observation are conditionally
independent given the class label. This assumption simplifies the computation and leads to a
computationally efficient and simple classification algorithm.
The Naive Bayes theorem is expressed mathematically as:
Example: Spam Classification using Naive Bayes:

Let's consider a simple example of classifying emails as spam or non-spam based on the presence of
certain words. Assume we have two features: X1 is the presence of the word "lottery," and X2 is the
presence of the word "buy."
- Training Data:
- We have a dataset with labeled emails:
- X1=1 if "lottery" is present, 0 otherwise
- X2=1 if "buy" is present, 0 otherwise
- Y =1 for spam, 0 for non-spamY=1 for spam, 0 for non-spam

- Prior Probabilities:
- P(Y = 1) is the prior probability of spam.
- P(Y = 0) is the prior probability of non-spam.
- Likelihood:
- P(X1 | Y) is the likelihood of observing the word "lottery" given the email is spam.
- P(X2| Y)is the likelihood of observing the word "buy" given the email is spam.
- Posterior Probability:
- P(Y = 1 | X1, X2) is the posterior probability of the email being spam given the presence of
"lottery" and "buy."
- Classification Decision:
- If P(Y = 1 X1, X2) > P(Y = 0 | X1, X2), classify the email as spam; otherwise, classify it as non-
spam.
In this example, the Naive Bayes classifier calculates the probabilities based on the training data, and
when a new email with features X1, X2 is encountered, it applies Bayes' theorem to determine the
probability of it being spam or non-spam. The classification decision is made based on comparing
these probabilities.

ML Module Iii

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Module Iii

Uploaded by

Copyright:

Available Formats

Module-III

1.What is Cross Validation?

What is cross-validation used for?

2. LOOCV (Leave One Out Cross Validation)

4. K-Fold Cross Validation

Example of K Fold Cross Validation

No. Iteration Training set observations Testing set observations

3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]

5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

2.Write short note on

Formula (for Entropy):

AdaBoost (Adaptive Boosting):

5 a. Describe Gradient Boosting?

Here is an overview of the Gradient Boosting process:

1. Initialize the Model:

6.Which method prevents over fitting in decision tress?

2. Limiting Tree Depth:

3. Minimum Samples per Leaf:

4. Minimum Samples per Split:

10. Regularization Techniques:

Batch Gradient Descent Stochastic Gradient Descent

Computes gradient using the whole

Not suggested for huge training

Deterministic in nature. Stochastic in nature.

Gives optimal solution given sufficient

The data sample should be in a random order, and this

Can’t escape shallow local minima

Convergence is slow. Reaches the convergence much faster.

The learning rate is fixed and cannot be

It typically converges to the global

8.Explain Naive Bayes Theorem with an example

The Naive Bayes theorem is expressed mathematically as:

Example: Spam Classification using Naive Bayes:

- X2=1 if "buy" is present, 0 otherwise

- Y =1 for spam, 0 for non-spamY=1 for spam, 0 for non-spam

You might also like