Professional Documents
Culture Documents
COMPREHENSIVE EXAMINATION
(EC-3 Regular)
Q.1 Explain what is “margin” is Support Vector Machine (SVM) algorithm? Explain what
are Support Vectors? (5 marks)
The margin in SVM is defined as the distance between the separating hyperplane
(decision boundary) and the training samples that are closest to this hyperplane.
These samples are called support vectors. The objective of SVM is to find a
hyperplane with the maximum possible margin between the hyperplane and any
point within the training set, reducing the risk of error on the unseen or test data.
Support Vectors:
Support vectors are the data points that lie closest to the decision surface (or
hyperplane). They are the data points most difficult to classify and have direct
bearing on the optimum location of the decision surface. They are the points that if
removed, would alter the position of the dividing hyperplane. Because of this, they
can be considered the critical elements of a dataset.
Q.2 What is Cross-Validation in Machine Learning? Explain the process. (5 marks)
Type I and Type II errors are concepts related to statistical hypothesis testing, and
they represent two types of incorrect conclusions that can be made in a test.
Type I Error: A Type I error, also known as a "false positive", occurs when we reject a
true null hypothesis. In other words, this is to say that there is an effect or
relationship when in fact there isn't. For example, if we are testing a new drug and we
conclude that it works when it actually does not, we have made a Type I error.
Type II Error: A Type II error, also known as a "false negative", occurs when we fail to
reject a false null hypothesis. This is to say that there is no effect or relationship when
in fact there is. For example, if we conclude that a new drug does not work when it
actually does, we have made a Type II error.
The potential for Type I and Type II errors needs to be considered and balanced
when designing studies and interpreting data. The levels at which these errors are
acceptable is often determined by the potential consequences of making such errors.
For instance, in medical testing, making a Type I error (falsely diagnosing a patient
with a disease) might be seen as more acceptable than making a Type II error (failing
to diagnose a disease that is present), because the consequences of the latter could
be more severe.
Q.4 Explain the steps of algorithm for K-Means clustering. (10 Marks)
1. Initialization: Choose the number of clusters (K) you want to divide your data
into. Then randomly initialize 'K' centroids. Centroids are the center points of
clusters.
2. Assignment: Assign each data point to the nearest centroid. The measure of
'nearness' is usually Euclidean distance, but it can also be Manhattan distance
or other similar metrics. This forms K clusters.
3. Update: Calculate the new centroids as the mean (average) of all data points
in a cluster. The mean becomes the new centroid of the cluster.
4. Iteration: Repeat the assignment and update steps iteratively until the
centroids do not change significantly, or a certain number of iterations are
reached. This means that the clusters have become stable.
5. Termination: The algorithm terminates when either the maximum number of
iterations is reached, or the centroids do not change significantly in two
consecutive iterations, or the change in the objective function is below a
certain threshold.
6. Evaluation: Evaluate the quality of the clusters using some evaluation metrics
like Silhouette Score, Dunn Index, etc.
The goal of K-means clustering is to minimize the distance between the points within
a cluster and maximize the distance between different clusters. It's important to note
that the initial selection of centroids is random, so K-means can produce different
results on different runs of the algorithm. To mitigate this, the algorithm is often run
multiple times with different initial conditions, and the most common output is
chosen.
Q.5 What is Ensemble learning? What are the different types? In which category does
Random Forest algorithm fall and why? (10 Marks)
Random Forest falls into the Bagging category of ensemble learning. The Random
Forest algorithm creates a set of decision trees from a randomly selected subset of
the training set, which then aggregates the votes from different decision trees to
decide the final class of the test object. This method of combining multiple models
helps to overcome the overfitting problem. The reason it falls into the Bagging
category is because it involves creating multiple subsets of the original dataset,
fitting a model to each subset, and then combining the predictions.
Q.6 What problem does occur when you use accuracy as a metric to evaluate a model built
over a highly imbalanced dataset? Explain. How can we handle this? (10 Marks)
For example, in a dataset where 95% of the instances are of the 'negative' class, a
model that always predicts 'negative' will be 95% accurate, despite failing to correctly
identify any 'positive' instances.
However, in many cases, correctly identifying the minority class is more important. In
a fraud detection scenario, for example, the number of non-fraud cases (negative
class) greatly outnumbers the fraud cases (positive class). A model that always
predicts 'non-fraud' could be highly accurate, but useless for the purpose of
detecting fraud.
To handle this, we can use other evaluation metrics that give us a better picture of
the model's performance:
********