You are on page 1of 5

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division


First Semester 2022-2023

COMPREHENSIVE EXAMINATION
(EC-3 Regular)

Course No. : BA ZG512


Course Title : PREDICTIVE ANALYTICS
Nature of Exam : Open Book
Weightage : 45% No. of Pages =1
Duration : 2½ Hours No. of Questions = 6
Date of Exam : Saturday, 26/11/2022 (FN)
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 Explain what is “margin” is Support Vector Machine (SVM) algorithm? Explain what
are Support Vectors? (5 marks)

Margin in Support Vector Machine (SVM) algorithm:

The margin in SVM is defined as the distance between the separating hyperplane
(decision boundary) and the training samples that are closest to this hyperplane.
These samples are called support vectors. The objective of SVM is to find a
hyperplane with the maximum possible margin between the hyperplane and any
point within the training set, reducing the risk of error on the unseen or test data.

Support Vectors:

Support vectors are the data points that lie closest to the decision surface (or
hyperplane). They are the data points most difficult to classify and have direct
bearing on the optimum location of the decision surface. They are the points that if
removed, would alter the position of the dividing hyperplane. Because of this, they
can be considered the critical elements of a dataset.
Q.2 What is Cross-Validation in Machine Learning? Explain the process. (5 marks)

Cross-validation is a resampling procedure used to evaluate machine learning


models on a limited data sample. The goal of cross-validation is to test the model's
ability to predict new data that was not used in estimating it, in order to flag
problems like overfitting or selection bias and to give an insight on how the model
will generalize to an independent dataset.

The Process of Cross-Validation:

1. Dataset Splitting: In the cross-validation process, the dataset is randomly split


into 'k' number of subsets or folds. If k = 5, then the dataset would be split
into 5 subsets, this is often called 5-fold cross-validation.
2. Model Training and Validation: The model is trained on k-1 folds with one
held back for testing. For example, if we have 5 folds, the model will be trained
on 4 folds and the remaining fold will be used for testing.
3. Performance Assessment: The model's performance is recorded against the
hold back set. The performance measure could be accuracy, recall, precision,
F1-score, or any other metric that is important to the problem at hand.
4. Repeat Steps 2 and 3: This process is repeated k times, each time with a
different fold held back for testing.
5. Average Performance: Once all iterations are complete, the performance for
each fold is then averaged, creating a more robust and less biased estimate of
model's performance.
6. Model Adjustment: If the model’s performance isn’t satisfactory, you could try
a different algorithm, introduce new features, or perform feature selection.

Cross-validation is a powerful preventative measure against overfitting. The idea is to


use most of our data for training, and a portion that isn’t involved in training for
testing or validation.

Q.3 What do you understand by Type I vs Type II error? (5 marks)

Type I and Type II errors are concepts related to statistical hypothesis testing, and
they represent two types of incorrect conclusions that can be made in a test.

Type I Error: A Type I error, also known as a "false positive", occurs when we reject a
true null hypothesis. In other words, this is to say that there is an effect or
relationship when in fact there isn't. For example, if we are testing a new drug and we
conclude that it works when it actually does not, we have made a Type I error.
Type II Error: A Type II error, also known as a "false negative", occurs when we fail to
reject a false null hypothesis. This is to say that there is no effect or relationship when
in fact there is. For example, if we conclude that a new drug does not work when it
actually does, we have made a Type II error.

The potential for Type I and Type II errors needs to be considered and balanced
when designing studies and interpreting data. The levels at which these errors are
acceptable is often determined by the potential consequences of making such errors.
For instance, in medical testing, making a Type I error (falsely diagnosing a patient
with a disease) might be seen as more acceptable than making a Type II error (failing
to diagnose a disease that is present), because the consequences of the latter could
be more severe.

Q.4 Explain the steps of algorithm for K-Means clustering. (10 Marks)

K-Means clustering is a type of unsupervised learning algorithm used to classify


items into groups or clusters based on their similarity. The algorithm follows the
following steps:

1. Initialization: Choose the number of clusters (K) you want to divide your data
into. Then randomly initialize 'K' centroids. Centroids are the center points of
clusters.
2. Assignment: Assign each data point to the nearest centroid. The measure of
'nearness' is usually Euclidean distance, but it can also be Manhattan distance
or other similar metrics. This forms K clusters.
3. Update: Calculate the new centroids as the mean (average) of all data points
in a cluster. The mean becomes the new centroid of the cluster.
4. Iteration: Repeat the assignment and update steps iteratively until the
centroids do not change significantly, or a certain number of iterations are
reached. This means that the clusters have become stable.
5. Termination: The algorithm terminates when either the maximum number of
iterations is reached, or the centroids do not change significantly in two
consecutive iterations, or the change in the objective function is below a
certain threshold.
6. Evaluation: Evaluate the quality of the clusters using some evaluation metrics
like Silhouette Score, Dunn Index, etc.

The goal of K-means clustering is to minimize the distance between the points within
a cluster and maximize the distance between different clusters. It's important to note
that the initial selection of centroids is random, so K-means can produce different
results on different runs of the algorithm. To mitigate this, the algorithm is often run
multiple times with different initial conditions, and the most common output is
chosen.
Q.5 What is Ensemble learning? What are the different types? In which category does
Random Forest algorithm fall and why? (10 Marks)

Ensemble learning is a machine learning paradigm where multiple models (often


called "base learners") are trained to solve the same problem and combined to get
better results. The main hypothesis behind ensemble methods is that when weak
models are correctly combined, we can obtain more accurate and/or robust models.

Different types of ensemble learning methods:

1. Bagging: Bagging stands for bootstrap aggregation. It combines multiple


learners in a way to reduce the variance of estimates. For example, Random
Forest is a bagging algorithm.
2. Boosting: Boosting is a sequential process, where each subsequent model
attempts to correct the errors of the previous model. The succeeding models
are dependent on the previous model. Examples include AdaBoost, Gradient
Boosting, XGBoost.
3. Stacking: In this approach, we combine models of different types (for instance,
a KNN classifier with a logistic regression model) or different training sets. The
base level models are trained based on a complete training set, then the
meta-model is fitted on the outputs of the base level model.

Random Forest falls into the Bagging category of ensemble learning. The Random
Forest algorithm creates a set of decision trees from a randomly selected subset of
the training set, which then aggregates the votes from different decision trees to
decide the final class of the test object. This method of combining multiple models
helps to overcome the overfitting problem. The reason it falls into the Bagging
category is because it involves creating multiple subsets of the original dataset,
fitting a model to each subset, and then combining the predictions.

Q.6 What problem does occur when you use accuracy as a metric to evaluate a model built
over a highly imbalanced dataset? Explain. How can we handle this? (10 Marks)

When a model is evaluated on a highly imbalanced dataset using accuracy as a


metric, it might give misleading results. This is because accuracy simply calculates the
ratio of correct predictions to total predictions. In a highly imbalanced dataset, the
model might predict almost always the majority class, and still achieve high accuracy.

For example, in a dataset where 95% of the instances are of the 'negative' class, a
model that always predicts 'negative' will be 95% accurate, despite failing to correctly
identify any 'positive' instances.
However, in many cases, correctly identifying the minority class is more important. In
a fraud detection scenario, for example, the number of non-fraud cases (negative
class) greatly outnumbers the fraud cases (positive class). A model that always
predicts 'non-fraud' could be highly accurate, but useless for the purpose of
detecting fraud.

To handle this, we can use other evaluation metrics that give us a better picture of
the model's performance:

1. Precision: It is the ratio of correctly predicted positive observations to the total


predicted positives.
2. Recall (Sensitivity): It is the ratio of correctly predicted positive observations to
the all observations in actual class.
3. F1 Score: The F1 score is the harmonic mean of precision and recall. It tries to
find the balance between precision and recall.
4. Area Under the ROC curve (AUC-ROC): ROC curve is a plot of recall
(sensitivity) vs 1-specificity. The area under this curve gives a single valued
summary of the model performance.

Additionally, techniques like oversampling the minority class, undersampling the


majority class, or using a combination of both (SMOTE) can be used to handle
imbalanced datasets. Using class weights to give higher importance to minority class
can also help.

********

You might also like