0% found this document useful (0 votes)
13 views16 pages

Machine Learning

Uploaded by

Shubham Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Machine Learning

Uploaded by

Shubham Ranga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MACHINE LEARNING

Paper Code: ETCS-402


Paper: Machine Learning

UNIT-I
Introduction:
Basic concepts: Definition of learning systems, Goals and applications of machine learning. Aspects of developing a
learning system: training data, concept representation, function approximation.
Types of Learning: Supervised learning and unsupervised learning. Overview of classification: setup, training, test,
validation dataset, over fitting.
Classification Families: linear discriminative, non-linear discriminative, decision trees, probabilistic (conditional and
generative), nearest neighbor.

UNIT-II
Logistic regression, Perceptron, Exponential family, Generative learning algorithms, Gaussian discriminant analysis,
Naive Bayes, Support vector machines: Optimal hyper plane, Kernels. Model selection and feature selection. Combining
classifiers: Bagging, boosting (The Ada boost algorithm), Evaluating and debugging learning algorithms, Classification
errors.
FIRST TERM EXAMINATION
2017

Q.1. (a) What do you understand by Reinforcement Learning?


Ans. Reinforcement Learning (RL) is a type of machine learning paradigm where an agent learns to make decisions by
interacting with an environment. The agent learns through trial and error, aiming to maximize some notion of
cumulative reward. Unlike supervised learning, where the model is trained on labeled data, and unsupervised learning,
where the model learns patterns from unlabeled data, RL learns from feedback received from the environment. This
feedback is typically in the form of rewards or penalties, which the agent seeks to maximize or minimize over time by
taking actions in the environment. RL is commonly used in scenarios where an agent needs to learn to make a sequence
of decisions in order to achieve a long-term goal, such as game playing, robotics, and autonomous driving.

Q.1. (b) What is over fitting?


Ans. Overfitting is a common problem in machine learning where a model learns the training data too well, capturing
noise or random fluctuations in the data rather than the underlying pattern. This results in a model that performs well
on the training data but poorly on unseen or test data. In other words, the model has memorized the training data
rather than learning the generalizable pattern. Overfitting typically occurs when a model is too complex relative to the
amount of training data available. To detect and prevent overfitting, techniques such as cross-validation, regularization,
and early stopping are commonly used. Regularization methods penalize overly complex models, while early stopping
halts training when the model starts to overfit the training data.

Q.1. (c) What do you understand by POMDP?


Ans. POMDP stands for Partially Observable Markov Decision Process. It is an extension of the Markov Decision Process
(MDP) framework that accounts for uncertainty in decision making by incorporating partial observability. In a POMDP,
an agent makes decisions in an environment where it does not directly observe the underlying state of the environment
but instead receives observations that are probabilistically related to the true state. The agent's goal is to maximize
some notion of cumulative reward over time by choosing a sequence of actions based on the observations received and
the belief about the current state of the environment. POMDPs are commonly used in sequential decision-making
problems where there is uncertainty and partial observability, such as robot navigation, dialogue systems, and
autonomous vehicle control.

Q.1.(d) Explain the concept of Hidden Markov model?


Ans. A Hidden Markov Model (HMM) is a statistical model used to describe sequences of observable events that are
generated by an underlying stochastic process with hidden states. In an HMM, there are two main components: the
hidden states and the observable symbols. The hidden states form a Markov chain where each state depends only on
the previous state, and the transitions between states are governed by transition probabilities. However, the hidden
states are not directly observable; instead, each hidden state emits observable symbols with certain probabilities. These
emission probabilities are associated with each state and determine the likelihood of emitting each observable symbol.
Given a sequence of observable symbols, the goal of an HMM is to infer the most likely sequence of hidden states that
generated the observed data. HMMs are widely used in various applications such as speech recognition, bioinformatics,
natural language processing, and financial modeling.
Q.2. Explain:
Q.2. (a) Markov Decision Process (MDP)
Ans. A Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems in
situations where outcomes are partially random and partially under the control of a decision-maker. It is defined by a
tuple (S,A,P,R,γ):
 S represents the set of states in the environment.
 A represents the set of actions that the decision-maker can take.
 P is the state transition probability function, which specifies the probability of transitioning from one state to
another given a particular action.
 R is the reward function, which specifies the immediate reward received by the decision-maker for taking a
particular action in a particular state.
 γ is the discount factor, which determines the importance of future rewards relative to immediate rewards.
The objective in an MDP is to find a policy, denoted by π, that maps each state to the best action to take in that state in
order to maximize the cumulative expected reward over time. This is typically achieved through algorithms such as value
iteration, policy iteration, or reinforcement learning methods.

Q.2.(b) Bellman's Equation


Ans. Bellman's Equation is a fundamental concept in dynamic programming and reinforcement learning, used to express
the value of a state or state-action pair in terms of the expected immediate reward and the expected value of the
successor states. There are two main formulations of Bellman's Equation:
 Bellman Expectation Equation (for state values):
 V(s)=∑aπ(a∣s)(R(s,a)+γ∑s′P(s′∣s,a)V(s′)) where:
 V(s) is the value of state s.
 π(a∣s) is the policy probability of taking action a in state s.
 R(s,a) is the immediate reward obtained after taking action a in state s.
 P(s′∣s,a) is the transition probability from state s to state s′ after taking action a.
 γ is the discount factor.
 Bellman Expectation Equation (for action values):
 Q(s,a)=R(s,a)+γ∑s′P(s′∣s,a)V(s′) where:
 Q(s,a) is the value of taking action a in state s.
These equations are used iteratively to estimate the values of states or state-action pairs in MDPs.

Q.2. (c) Value function approximation algorithm


Ans. Value function approximation algorithms are techniques used to estimate the value function in large or continuous
state spaces where exact methods such as dynamic programming become computationally infeasible. Instead of storing
the value of each state explicitly, these algorithms approximate the value function using a parameterized function, often
a neural network or a linear function.
One common approach is to use gradient-based optimization methods to adjust the parameters of the approximation
function in order to minimize the difference between the predicted values and the observed rewards. Examples of value
function approximation algorithms include:
 Deep Q-Networks (DQN): Uses deep neural networks to approximate the Q-values in reinforcement learning
problems.
 Approximate Dynamic Programming (ADP): Utilizes function approximation techniques such as linear
programming or neural networks to solve MDPs approximately.
 TD-learning with Function Approximation: Temporal Difference (TD) learning algorithms, such as TD(λ) or
SARSA(λ), combined with function approximation, allow for efficient learning in large state spaces.

Q.3 (a) Explain in detail Q-Learning


Ans. Q-Learning is a model-free reinforcement learning algorithm used to find an optimal action-selection policy for a
given finite Markov Decision Process (MDP). It is based on the principle of iteratively estimating the values of state-
action pairs (Q-values) while interacting with the environment. Q-learning is particularly effective in situations where the
dynamics of the environment are unknown, and the agent learns from trial and error.

Here's a step-by-step explanation of Q-Learning:

1. **Initialization**: Initialize a Q-table with random values or zeros, where each row corresponds to a state and each
column corresponds to an action.

2. **Exploration-Exploitation Tradeoff**: During each time step, the agent selects an action to take in the current state
based on an exploration-exploitation strategy. This strategy balances between exploring new actions and exploiting the
current best-known actions.

3. **Action Selection**: The agent selects an action based on the current state and the Q-values stored in the Q-table.
Common exploration-exploitation strategies include ε-greedy, softmax exploration, and Upper Confidence Bound (UCB).

4. **Observation and Reward**: After taking an action, the agent observes the next state and the immediate reward
received from the environment.

5. **Update Q-values**: Using the observed reward and the next state, the agent updates the Q-value of the current
state-action pair using the Q-learning update rule:
\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \]
where:
- \( Q(s, a) \) is the Q-value of state-action pair \( (s, a) \).
- \( r \) is the immediate reward received after taking action \( a \) in state \( s \).
- \( s' \) is the next state.
- \( \alpha \) is the learning rate (step size), controlling the rate of updates.
- \( \gamma \) is the discount factor, determining the importance of future rewards.

6. **Repeat**: Continue interacting with the environment, selecting actions, observing rewards and states, and
updating Q-values until convergence or a predefined stopping criterion is met.

7. **Policy Extraction**: Once the Q-values have converged or after a certain number of iterations, the agent extracts
the optimal policy by selecting the action with the highest Q-value for each state.

Q-Learning is a powerful and widely used algorithm in reinforcement learning, particularly in scenarios where the agent
interacts with an unknown environment and must learn from experience.

Q.3. (b) What do you mean by Linear quadratic regulation (LQR).


Ans. Linear Quadratic Regulation (LQR) is a classical control technique used to design optimal control policies for linear
dynamical systems subject to quadratic cost functions. It is applicable to both continuous and discrete-time systems.

In LQR, the dynamics of the system are described by a linear state-space model:
\[ x_{t+1} = A x_t + B u_t \]
where:
- \( x_t \) is the state of the system at time \( t \).
- \( u_t \) is the control input at time \( t \).
- \( A \) is the state transition matrix.
- \( B \) is the control input matrix.

The goal of LQR is to find a control policy that minimizes a quadratic cost function of the form:
\[ J = \sum_{t=0}^{T} (x_t^T Q x_t + u_t^T R u_t) \]
where:
- \( Q \) is the state cost matrix, representing the penalty for deviation from the desired state.
- \( R \) is the control cost matrix, representing the penalty for control effort.
- \( T \) is the time horizon.

The optimal control policy for LQR can be obtained by solving the associated algebraic Riccati equation, which gives the
optimal state feedback gain matrix \( K \). The optimal control input is then given by:
\[ u_t = -K x_t \]

LQR is widely used in various fields, including aerospace, robotics, and process control, for designing optimal control
strategies that minimize a specified cost function while satisfying system dynamics and constraints.

Q.4. Explain in detail Spectral Clustering?


Ans. Spectral Clustering is a popular technique used for clustering data points based on the similarity between them.
Unlike traditional clustering methods like K-means, which operate in the original feature space, spectral clustering
operates on a transformed representation of the data, typically obtained from the eigenvectors of a similarity or affinity
matrix.

Here's how spectral clustering works:

1. **Construct Similarity Graph**: Given a dataset with \( N \) data points, a similarity graph is constructed where each
node represents a data point, and the edges represent the pairwise similarity between data points. Common methods
for measuring similarity include Gaussian kernel, nearest neighbors, or epsilon-neighborhood.

2. **Affinity Matrix**: From the similarity graph, an affinity matrix \( W \) is constructed, where \( W_{ij} \) represents
the similarity between data points \( i \) and \( j \). This matrix captures the pairwise relationships between data points.

3. **Graph Laplacian**: The graph Laplacian matrix \( L \) is computed from the affinity matrix. There are different
formulations of the graph Laplacian, such as the unnormalized Laplacian, normalized Laplacian, and random walk
Laplacian.

4. **Eigendecomposition**: Eigenvectors and eigenvalues of the graph Laplacian matrix are computed. The
eigenvectors corresponding to the smallest eigenvalues capture the underlying structure of the data and are used for
clustering.

5. **Dimensionality Reduction**: The eigenvectors are used to embed the data points into a lower-dimensional space.
Typically, the eigenvectors corresponding to the \( k \) smallest eigenvalues (where \( k \) is the number of clusters) are
selected.
6. **Clustering**: Finally, traditional clustering algorithms such as K-means or Normalized Cuts are applied to the
embedded data points in the reduced-dimensional space to obtain the final clusters.

Spectral clustering is effective for identifying clusters in complex datasets, including those with non-linear decision
boundaries or irregular shapes. It also has connections to graph theory and spectral graph theory, making it a powerful
tool for analyzing data with underlying graph structures.

FIRST TERM EXAMINATION


2018

Q. 1. (a) What are goals of Machine Learning?


Ans. The goals of machine learning can vary depending on the specific task and application, but generally, they include:
1. Prediction: One of the primary goals of machine learning is to make accurate predictions or classifications based
on input data. This involves learning patterns and relationships from historical data to generalize to unseen
instances.
2. Classification: Another goal is to classify data points into different categories or classes based on their features.
Classification tasks include spam detection, image recognition, and sentiment analysis.
3. Regression: Regression aims to predict continuous numerical values based on input features. It is used in various
applications such as predicting house prices, stock market trends, or customer churn rates.
4. Clustering: Clustering involves grouping similar data points together based on their inherent patterns or
similarities. Clustering algorithms help discover hidden structures within the data and are used in customer
segmentation, anomaly detection, and recommendation systems.
5. Anomaly Detection: Detecting anomalies or outliers in data is another goal of machine learning. Anomaly
detection techniques help identify unusual or unexpected patterns that deviate from normal behavior, such as
fraudulent transactions or equipment failures.
6. Reinforcement Learning: In reinforcement learning, the goal is to develop agents that learn optimal decision-
making strategies by interacting with an environment. The agent aims to maximize cumulative rewards over
time by taking actions and learning from feedback.

Q.1. (b) What is the difference between Supervised and unsupervised learning?
Ans. The main difference between supervised and unsupervised learning lies in the type of training data used and the
nature of the learning process:
1. Supervised Learning:
 Supervised learning involves learning a mapping from input data to output labels based on labeled
training examples.
 In supervised learning, the algorithm is provided with a dataset consisting of input-output pairs, where
the outputs are known or labeled.
 The goal is to learn a model that can generalize from the training data to make accurate predictions or
classifications on unseen data.
 Examples of supervised learning algorithms include linear regression, logistic regression, decision trees,
support vector machines (SVM), and neural networks.
2. Unsupervised Learning:
 Unsupervised learning involves discovering patterns or structures in input data without explicit
supervision or labeled outputs.
 In unsupervised learning, the algorithm is provided with a dataset consisting only of input data, and it
must learn to find hidden patterns or groupings on its own.
 The goal is to explore the underlying structure of the data, such as clustering similar data points
together or reducing the dimensionality of the data.
 Examples of unsupervised learning algorithms include K-means clustering, hierarchical clustering,
principal component analysis (PCA), and autoencoders.

Q.1. (c) What is Logistic Regression?


Ans. Logistic Regression is a statistical method used for binary classification tasks, where the output variable (target) is
categorical and has only two possible outcomes, typically labeled as 0 and 1 (e.g., yes/no, true/false, spam/not spam).
In logistic regression, the relationship between the input features and the probability of belonging to a particular class is
modeled using the logistic function (also known as the sigmoid function). The logistic function transforms the output of a
linear combination of input features into a value between 0 and 1, representing the probability of the positive class.
The logistic regression model can be represented by the following equation:
�(�=1∣�)=11+�−(�0+�1�1+�2�2+...+����)P(y=1∣x)=1+e−(β0+β1x1+β2x2+...+βnxn)1
Where:
 �(�=1∣�)P(y=1∣x) is the probability of the positive class given input features �x.
 �e is the base of the natural logarithm.
 �0,�1,...,��β0,β1,...,βn are the coefficients (parameters) of the model.
 �1,�2,...,��x1,x2,...,xn are the input features.
During training, logistic regression aims to learn the optimal values of the coefficients by minimizing a loss function,
typically the cross-entropy loss, using optimization techniques such as gradient descent.
Logistic regression is widely used in various fields, including medicine (e.g., disease diagnosis), finance (e.g., credit risk
assessment), and marketing (e.g., customer churn prediction), due to its simplicity, interpretability, and effectiveness for
binary classification tasks.

Q.2. What is learning? Discuss any four Learning Techniques.


Ans. Learning refers to the process of acquiring knowledge or skills through experience, study, or instruction. In the
context of machine learning, learning involves building models or algorithms that can automatically improve their
performance on a task based on data. Here are four common learning techniques:
1. Supervised Learning: Supervised learning involves training a model on a labeled dataset, where each data point
is associated with an input feature vector and a corresponding target label. The goal is to learn a mapping from
input features to output labels, such that the model can make accurate predictions on unseen data. Examples of
supervised learning algorithms include linear regression, logistic regression, decision trees, support vector
machines (SVM), and neural networks.
2. Unsupervised Learning: Unsupervised learning involves training a model on an unlabeled dataset, where the
algorithm must discover patterns, structures, or groupings in the data without explicit supervision. The goal is to
explore the underlying structure of the data, such as clustering similar data points together or reducing the
dimensionality of the data. Examples of unsupervised learning algorithms include K-means clustering,
hierarchical clustering, principal component analysis (PCA), and autoencoders.
3. Reinforcement Learning: Reinforcement learning involves training an agent to make sequential decisions in an
environment in order to maximize cumulative rewards. The agent learns through trial and error, receiving
feedback in the form of rewards or penalties based on its actions. The goal is to learn an optimal policy that
maps states to actions, such that the agent can achieve its long-term objectives. Examples of reinforcement
learning algorithms include Q-learning, Deep Q-Networks (DQN), and policy gradient methods.
4. Semi-Supervised Learning: Semi-supervised learning is a combination of supervised and unsupervised learning
techniques, where the model is trained on a dataset that contains both labeled and unlabeled data. The goal is
to leverage the unlabeled data to improve the performance of the model, particularly when labeled data is
scarce or expensive to obtain. Semi-supervised learning algorithms include self-training, co-training, and
generative models such as generative adversarial networks (GANs).
Each learning technique has its own strengths and weaknesses, and the choice of technique depends on the nature of
the data, the task at hand, and the available resources.

Q.3. (a) Discuss Bagging and Boosting.


Ans. Bagging and Boosting are ensemble learning techniques used to improve the performance of machine learning
models by combining multiple base models.
 Bagging (Bootstrap Aggregating): Bagging involves training multiple base models (e.g., decision trees) on
different random subsets of the training data, with replacement. Each base model is trained independently, and
their predictions are combined through averaging (for regression tasks) or voting (for classification tasks) to
make the final prediction. Bagging helps reduce variance and overfitting by introducing diversity among the base
models. Random Forest is a popular bagging algorithm that uses decision trees as base models.
 Boosting: Boosting is an iterative ensemble learning technique that trains a sequence of weak learners (models
that perform slightly better than random guessing) sequentially, with each subsequent model focusing on the
mistakes made by the previous ones. Boosting algorithms assign higher weights to misclassified data points in
each iteration, allowing subsequent models to pay more attention to these difficult-to-classify instances.
Gradient Boosting Machines (GBM), AdaBoost, and XGBoost are popular boosting algorithms.

Q. 3. (b) Write ADA Boost Algorithm?


Ans. AdaBoost (Adaptive Boosting) is a popular boosting algorithm that sequentially trains a series of weak learners and
combines their predictions to create a strong learner. Here's the AdaBoost algorithm:
1. Initialize the sample weights: Assign equal weights to each training sample ��=1�wi=N1, where �N is the
total number of samples.
2. For �=1t=1 to �T, where �T is the number of boosting iterations: a. Train a weak learner (e.g., decision tree)
on the training data with the current sample weights. b. Calculate the weighted error ��ϵt of the weak

ℎ�(��)ht(xi) is the prediction of the weak learner for sample ��xi, ��yi is the true label, and
learner: ��=∑�=1���(�)×Indicator(ℎ�(��)≠��)ϵt=∑i=1Nwi(t)×Indicator(ht(xi)=yi) where

��(�)wi(t) is the weight of sample �i at iteration �t. c. Compute the weak learner's contribution to the
ensemble: ��=12ln⁡(1−����)αt=21ln(ϵt1−ϵt) d. Update the sample weights:
��(�+1)=��(�)×exp⁡(−��×��×ℎ�(��))wi(t+1)=wi(t)×exp(−αt×yi×ht(xi)) e. Normalize the sample
weights: ��(�+1)=��(�+1)∑�=1���(�+1)wi(t+1)=∑j=1Nwj(t+1)wi(t+1)
3. Output the final ensemble model: �(�)=sign(∑�=1���×ℎ�(�))H(x)=sign(∑t=1Tαt×ht(x))
In the final model, �(�)H(x) represents the combined prediction of all weak learners, and ��αt represents the
contribution weight of each weak learner. AdaBoost combines the weak learners' predictions based on their individual
performances, with higher weights assigned to the more accurate weak learners.

Q.4 Explain decision tree algorithm in detail?


Ans. Decision tree is a popular supervised learning algorithm used for both classification and regression tasks. It works
by recursively partitioning the input space into regions and assigning a label or value to each region based on the
majority class or average of the training instances within that region. Here's how the decision tree algorithm works:
1. Selecting the Best Split:
 The decision tree algorithm starts with the entire dataset at the root node.
 It evaluates different splitting criteria (e.g., Gini impurity, entropy, or information gain for classification
tasks; variance reduction for regression tasks) to determine the best feature and split point that
maximizes the homogeneity (or purity) of the resulting child nodes.
2. Splitting the Dataset:
 Once the best split is found, the dataset is divided into two or more subsets based on the selected
feature and split point.
3. Recursive Partitioning:
 The process is repeated recursively for each child node until one of the stopping criteria is met, such as
reaching the maximum depth of the tree, reaching the minimum number of samples in a node, or no
further improvement in purity can be achieved.
4. Creating Leaf Nodes:
 Once the stopping criteria are met, the leaf nodes are created, and each leaf node is assigned a class
label (for classification) or a predicted value (for regression), typically based on the majority class or
average of the training instances within that node.
5. Predictions:
 During the prediction phase, new instances are passed down the tree, and their class label or predicted
value is determined based on the decision path followed from the root node to the corresponding leaf
node.
Decision trees are interpretable, easy to visualize, and capable of capturing complex decision boundaries. However, they
are prone to overfitting, especially when the trees are deep and have high variance. Techniques such as pruning, limiting
the tree depth, and using ensemble methods like Random Forests can help mitigate overfitting and improve the
performance of decision trees.

FIRST TERM EXAMINATION


2019

Q.1. (a) What are the goals of machine learning?


Ans. The goals of machine learning encompass various objectives and applications, but some common goals include:
1. Prediction: One of the primary goals of machine learning is to make accurate predictions or classifications based
on input data. This involves learning patterns and relationships from historical data to generalize to unseen
instances.
2. Classification: Another goal is to classify data points into different categories or classes based on their features.
Classification tasks include spam detection, image recognition, and sentiment analysis.
3. Regression: Regression aims to predict continuous numerical values based on input features. It is used in various
applications such as predicting house prices, stock market trends, or customer churn rates.
4. Clustering: Clustering involves grouping similar data points together based on their inherent patterns or
similarities. Clustering algorithms help discover hidden structures within the data and are used in customer
segmentation, anomaly detection, and recommendation systems.
5. Anomaly Detection: Detecting anomalies or outliers in data is another goal of machine learning. Anomaly
detection techniques help identify unusual or unexpected patterns that deviate from normal behavior, such as
fraudulent transactions or equipment failures.
6. Reinforcement Learning: In reinforcement learning, the goal is to develop agents that learn optimal decision-
making strategies by interacting with an environment. The agent aims to maximize cumulative rewards over
time by taking actions and learning from feedback.
Q. 1. (b) Explain Overfitting?
Ans. Overfitting is a common problem in machine learning where a model learns the training data too well, capturing
noise or random fluctuations in the data rather than the underlying pattern. This results in a model that performs well
on the training data but poorly on unseen or test data. Overfitting typically occurs when a model is too complex relative
to the amount of training data available.
The main causes of overfitting include:
 Model Complexity: Models with a large number of parameters or high flexibility can capture noise in the training
data, leading to overfitting.
 Insufficient Data: If the training dataset is small, the model may memorize the training examples rather than
learning the underlying patterns.
 Lack of Regularization: Without regularization techniques such as L1 or L2 regularization, dropout, or early
stopping, the model may overfit the training data by fitting too closely to the noise.
To detect and prevent overfitting, various techniques can be employed:
 Cross-Validation: Splitting the dataset into training and validation sets and using cross-validation to evaluate
model performance can help detect overfitting.
 Regularization: Adding penalty terms to the loss function, such as L1 or L2 regularization, helps prevent
overfitting by penalizing large parameter values.
 Dropout: Dropout randomly deactivates neurons during training, forcing the model to learn redundant
representations and reducing overfitting.
 Early Stopping: Monitoring the validation loss during training and stopping the training process when the
validation loss starts to increase can prevent overfitting.

Q.1. (c) What is nearest neighbor?


Ans. Nearest Neighbor (NN) is a simple yet effective algorithm used for classification and regression tasks. It belongs to
the family of instance-based learning algorithms and works by comparing a new data point to the existing data points in
the training dataset to make predictions.
In the case of classification, the nearest neighbor algorithm identifies the training data point(s) that are closest (most
similar) to the new data point based on a distance metric, such as Euclidean distance or cosine similarity. The class label
of the majority of the nearest neighbors is assigned to the new data point.
In the case of regression, the nearest neighbor algorithm predicts the target value for the new data point by averaging
the target values of its nearest neighbors.
Nearest Neighbor is a non-parametric algorithm, meaning it does not make any assumptions about the underlying
distribution of the data. It is also known as a lazy learner because it does not explicitly learn a model from the training
data; instead, it memorizes the training instances and uses them during prediction.

Q. 1. (d) Describe the limitations of perception model.


Ans. The perceptron model, while pioneering in neural network research, has several limitations:
1. Linear Separability Requirement: The original perceptron algorithm can only learn linearly separable patterns. It
fails to converge if the data is not linearly separable, leading to limited applicability in real-world scenarios
where the data is often complex and non-linearly separable.
2. Binary Classification: The perceptron model is restricted to binary classification tasks, where it predicts whether
a data point belongs to one of the two classes. Extending perceptrons to handle multi-class classification
requires additional techniques such as one-vs-all or one-vs-one strategies.
3. Single-Layer Architecture: The perceptron model consists of a single layer of neurons, limiting its ability to learn
complex hierarchical patterns and relationships in the data. It cannot capture non-linear decision boundaries or
represent complex functions effectively.
4. Sensitivity to Initialization: The performance of the perceptron model is sensitive to the initialization of weights.
Different initializations can lead to different solutions, and the model may converge to suboptimal solutions or
fail to converge altogether.
5. Perceptron Convergence Theorem: While the perceptron convergence theorem guarantees convergence for
linearly separable data, it does not provide any guarantees for non-linearly separable data. In practice, the
convergence rate and stability of the perceptron algorithm can vary depending on the data distribution and the
choice of hyperparameters.
6. No Probabilistic Interpretation: The perceptron model does not provide probabilistic outputs or uncertainty
estimates, making it unsuitable for probabilistic classification tasks or applications where confidence levels are
important.
Despite these limitations, the perceptron model laid the foundation for more advanced neural network architectures
and learning algorithms, leading to the development of modern deep learning models capable of handling complex tasks
such as image recognition, natural language processing, and reinforcement learning.

Q. 2. (a) What is learning? Discuss any two Learning Techniques. (2018)


Ans. The Naive Bayes theorem is a fundamental probabilistic theorem used for classification tasks in machine learning. It
is based on Bayes' theorem, which describes the probability of a hypothesis given observed evidence.
Mathematically, the Naive Bayes theorem can be expressed as:
�(�∣�1,�2,...,��)=�(�)×�(�1,�2,...,��∣�)�(�1,�2,...,��)P(y∣x1,x2,...,xn)=P(x1,x2,...,xn)P(y)×P(x1,x2
,...,xn∣y)
Where:
 �(�∣�1,�2,...,��)P(y∣x1,x2,...,xn) is the posterior probability of class �y given input features
�1,�2,...,��x1,x2,...,xn.
 �(�)P(y) is the prior probability of class �y.
 �(�1,�2,...,��∣�)P(x1,x2,...,xn∣y) is the likelihood of observing features �1,�2,...,��x1,x2,...,xn given
class �y.
 �(�1,�2,...,��)P(x1,x2,...,xn) is the marginal probability of observing features �1,�2,...,��x1,x2,...,xn.
The "naive" assumption in Naive Bayes is that the features are conditionally independent given the class label �y, which
means that the presence of one feature does not affect the presence of another feature. This simplifies the computation
of the likelihood term.
Naive Bayes classifiers are widely used in text classification tasks, such as spam filtering and document categorization,
where the features represent word occurrences or frequencies. Despite its simplicity and the naive assumption, Naive
Bayes often performs well in practice and is computationally efficient, making it suitable for large-scale classification
tasks.

Q. 2. (b) What is the difference between supervised and unsupervised learning? (2018)
Q. 3. (a) What is Naive Bayes theorem? How is it useful in machine learning?
Q. 3. (b) Explain the generative probabilistic classification.
Ans. Generative probabilistic classification is a classification approach that models the joint probability distribution of the
input features and the class labels. Unlike discriminative models, which directly learn the decision boundary between
classes, generative models learn the probability distribution of each class and use Bayes' theorem to compute the
posterior probability of each class given the input features.
In generative probabilistic classification, the model assumes a specific distribution for the input features conditioned on
each class. Commonly used distributions include Gaussian (for continuous features), multinomial (for discrete features),
and Bernoulli (for binary features). The parameters of these distributions are estimated from the training data.
Once the probability distributions for each class are learned, the model can compute the likelihood �(�∣�)P(x∣y) of
observing the input features �x given each class �y. Then, Bayes' theorem is used to compute the posterior probability
�(�∣�)P(y∣x) of each class given the input features.
Generative probabilistic classifiers include algorithms such as Naive Bayes, Gaussian Naive Bayes, and multinomial Naive
Bayes. These classifiers are particularly useful when dealing with small training datasets or when the assumption of
feature independence is reasonable.
One advantage of generative models is their ability to generate synthetic samples from each class's probability
distribution, allowing for data augmentation and generating new samples for imbalanced datasets. However, they may
suffer from model misspecification if the assumed distribution does not match the true distribution of the data.

Q.4. (a) Discuss Bagging and Boosting. (2018)


Q. 4.(b) Write ADA Boost Algorithm. (2018)

FIRST TERM EXAMINATION


2023

Q.1. (a) What do you understand by noise in data? What could be implications on the result, if noise is not treated
properly?
Ans. Noise in data refers to random variations or errors in the data that do not represent the underlying patterns or
relationships. Noise can arise from various sources, such as measurement errors, data collection artifacts, or natural
variability in the phenomenon being observed.
The implications of not treating noise properly in data can be significant:
1. Reduced Accuracy: Noise can distort the true relationships between variables, leading to inaccurate predictions
or classifications. Models trained on noisy data may produce unreliable results and have poor performance on
unseen data.
2. Overfitting: Noise in the training data can be mistaken by the model as meaningful patterns, resulting in
overfitting. Overfitting occurs when the model captures noise in the training data rather than the underlying
structure, leading to poor generalization to new data.
3. Misinterpretation of Results: Noise can obscure meaningful insights or relationships in the data, leading to
incorrect conclusions or misinterpretations of the results. Decision-makers may make flawed decisions based on
unreliable or misleading findings.
4. Decreased Robustness: Models trained on noisy data may lack robustness and fail to generalize well to new or
unseen data. They may be sensitive to small variations in the input data and produce inconsistent or unstable
predictions.
5. Wasted Resources: Analyzing or modeling noisy data can waste time, computational resources, and effort. It
may require additional preprocessing steps or model tuning to mitigate the effects of noise, increasing the
complexity and cost of the analysis.
To mitigate the effects of noise, it is essential to preprocess the data carefully, including cleaning, filtering, and
normalization steps. Additionally, using robust modeling techniques, such as regularization or ensemble methods, can
help improve the model's resilience to noise and enhance its generalization performance.
Q.1. (b) What do you understand by over fitting of data? Give any two methods to avoid over fitting?
Ans. Overfitting occurs when a machine learning model captures noise or random fluctuations in the training data rather
than the underlying patterns or relationships. As a result, the model performs well on the training data but fails to
generalize to new, unseen data.
Two methods to avoid overfitting include:
1. Cross-Validation: Cross-validation is a technique used to assess a model's performance and generalization
ability. Instead of evaluating the model on the training data alone, cross-validation involves splitting the dataset
into multiple subsets (e.g., training and validation sets) and iteratively training and evaluating the model on
different combinations of these subsets. This allows for more robust estimation of the model's performance and
helps identify potential overfitting by assessing its performance on unseen data.
2. Regularization: Regularization is a technique used to prevent overfitting by adding a penalty term to the model's
objective function. Common regularization techniques include L1 regularization (Lasso), L2 regularization
(Ridge), and elastic net regularization. These techniques impose constraints on the model's parameters,
encouraging simpler models and reducing the tendency to fit noise in the training data. Regularization helps
balance the trade-off between model complexity and generalization performance, leading to more robust and
interpretable models.

Q.1. (c) When should we use classification over regression? Explain using example.
Ans. Classification and regression are two types of supervised learning tasks used to solve different types of problems:
 Classification: Classification is used when the target variable (or output) is categorical or discrete, meaning it
belongs to a specific class or category. The goal of classification is to predict the class label of new data points
based on their features. Classification problems include binary classification (two classes) and multiclass
classification (more than two classes).
 Regression: Regression is used when the target variable is continuous or numerical, meaning it can take any real
value within a specific range. The goal of regression is to predict the numerical value of the target variable based
on the input features. Regression problems include predicting house prices, stock prices, or temperature.
For example, suppose we have a dataset containing information about customers' attributes (such as age, income, and
education level) and whether they purchased a product (yes or no). In this scenario, we want to predict whether a new
customer will purchase the product or not based on their attributes. Since the target variable (purchase decision) is
categorical (yes or no), this problem is a classification problem. We would use a classification algorithm such as logistic
regression, decision trees, or support vector machines to build a model that can classify new customers into the
"purchased" or "not purchased" categories.
On the other hand, if we want to predict the amount of money a customer is likely to spend on the product based on
their attributes, we would use regression. In this case, the target variable (amount spent) is continuous, and we would
use regression techniques such as linear regression, polynomial regression, or neural networks to predict the numerical
value of the amount spent based on the input features.
In summary, we use classification when the target variable is categorical, and the goal is to predict class labels, while we
use regression when the target variable is continuous, and the goal is to predict numerical values.

Q.1. (d) Define the terms-Precision, Recall, FI-score and Accuracy?


 Ans. Precision: Precision measures the proportion of true positive predictions among all positive predictions
made by the model. It quantifies the model's ability to avoid false positives and is calculated as the ratio of true
positives to the sum of true positives and false positives.
Precision=True PositivesTrue Positives+False PositivesPrecision=True Positives+False PositivesTrue Positives
 Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of true positive
predictions among all actual positive instances in the dataset. It quantifies the model's ability to capture all
positive instances and is calculated as the ratio of true positives to the sum of true positives and false negatives.
Recall=True PositivesTrue Positives+False NegativesRecall=True Positives+False NegativesTrue Positives
 F1-score: The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both
precision and recall. It ranges between 0 and 1, with higher values indicating better model performance. The F1-
score is calculated as the harmonic mean of precision and recall: F1-score=2×Precision×RecallPrecision+RecallF1-
score=2×Precision+RecallPrecision×Recall
 Accuracy: Accuracy measures the proportion of correctly classified instances (both true positives and true
negatives) among all instances in the dataset. It quantifies the overall correctness of the model's predictions and
is calculated as the ratio of correctly classified instances to the total number of instances.
Accuracy=True Positives+True NegativesTotal InstancesAccuracy=Total InstancesTrue Positives+True Negatives
These metrics are commonly used to evaluate the performance of classification models, providing insights into the
model's ability to make correct predictions and avoid errors. Depending on the specific requirements of the problem,
different metrics may be prioritized. For example, in a medical diagnosis task, recall may be more important than
precision to ensure that all positive instances (e.g., patients with a disease) are correctly identified, even at the cost of
some false positives.

Q.1. (e) Define LDA and any of its two limitations.


 Ans. LDA (Linear Discriminant Analysis): Linear Discriminant Analysis is a dimensionality reduction technique
and a supervised classification algorithm. It seeks to find a linear combination of features that best separates
multiple classes in the dataset. LDA projects the data onto a lower-dimensional space while maximizing the
between-class distance and minimizing the within-class variance.
Two limitations of LDA are:
1. Assumption of Linear Separability: LDA assumes that the classes in the dataset are linearly separable, meaning
they can be separated by a linear decision boundary. In practice, if the classes are not linearly separable, LDA
may produce suboptimal results or fail to find an effective discriminant subspace.
2. Sensitivity to Outliers: LDA is sensitive to outliers in the data, as it seeks to minimize the within-class variance.
Outliers can significantly impact the estimation of class centroids and scatter matrices, leading to biased
projections and potentially poor classification performance. Preprocessing techniques such as outlier removal or
robust estimators may be necessary to mitigate the effects of outliers when using LDA.

Q.2. (a) Differentiate between Supervised Learning and Unsupervised learning. (2018, 2019)
Q.2. (b) Explain the generative probabilistic classification. (2019)
Q.2. (c) Discuss bagging and boosting. (2018, 2019)
Q.3. (a) Explain Bayesian estimation and maximum likelihood estimation in generative learning.
Ans. In generative learning, both Bayesian estimation and maximum likelihood estimation (MLE) are techniques used to
estimate the parameters of a probability distribution that best fit the observed data. These techniques are fundamental
in probabilistic modeling and are commonly employed in tasks such as classification and density estimation.
1. Bayesian Estimation:
Bayesian estimation incorporates prior knowledge about the parameters of the probability distribution into the
estimation process. It updates the prior knowledge based on observed data to obtain a posterior distribution over the
parameters.
Mathematically, Bayesian estimation involves calculating the posterior distribution �(�∣�)P(θ∣D) of the parameters
�θ given the observed data �D using Bayes' theorem:
�(�∣�)=�(�∣�)×�(�)�(�)P(θ∣D)=P(D)P(D∣θ)×P(θ)
Where:
 �(�∣�)P(θ∣D) is the posterior distribution of the parameters given the data.
 �(�∣�)P(D∣θ) is the likelihood function, representing the probability of observing the data given the
parameters.
 �(�)P(θ) is the prior distribution of the parameters, representing any prior knowledge or beliefs about
their values.
 �(�)P(D) is the marginal likelihood or evidence, representing the probability of observing the data.
The posterior distribution provides a complete representation of the uncertainty in the parameters after observing the
data. In practice, the posterior distribution is often approximated using techniques such as Markov chain Monte Carlo
(MCMC) or variational inference.
2. Maximum Likelihood Estimation (MLE):
Maximum Likelihood Estimation (MLE) is a frequentist approach that seeks to find the set of parameters that maximizes
the likelihood of observing the given data. It assumes that the observed data are generated from a specific probability
distribution with unknown parameters, and the goal is to find the parameter values that make the observed data most
probable.
Mathematically, MLE involves maximizing the likelihood function �(�∣�)L(θ∣D) with respect to the parameters �θ:
�^MLE=arg⁡max⁡��(�∣�)θ^MLE=argmaxθL(θ∣D)
Where:
 �^MLEθ^MLE is the maximum likelihood estimate of the parameters.
 �(�∣�)L(θ∣D) is the likelihood function, representing the probability of observing the data given the
parameters.
In practice, it is often more convenient to maximize the log-likelihood function, ℓ(�∣�)ℓ(θ∣D), which is the natural
logarithm of the likelihood function. Maximizing the log-likelihood is equivalent to maximizing the likelihood and
simplifies the optimization process.
In summary, Bayesian estimation incorporates prior knowledge into the estimation process, resulting in a posterior
distribution over the parameters, while maximum likelihood estimation seeks to find the parameter values that
maximize the likelihood of observing the given data. Both techniques are widely used in generative learning and provide
different perspectives on parameter estimation.

Q.3. (b) Explain the decision tree algorithm with example. (2018)
Q.4. Write short note on:
Q.4. (a) Support Vector Machine (SVM).
Ans. Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and
regression tasks. The primary objective of SVM is to find the optimal hyperplane that best separates the data points into
different classes while maximizing the margin between the classes.
Key features of SVM include:
 Maximum Margin: SVM seeks to find the hyperplane that maximizes the margin between the closest data
points of different classes, known as support vectors. By maximizing the margin, SVM aims to achieve better
generalization performance and robustness to noise.
 Kernel Trick: SVM can efficiently handle nonlinear decision boundaries by mapping the input features into a
higher-dimensional space using kernel functions such as polynomial, radial basis function (RBF), or sigmoid
kernels. This allows SVM to capture complex patterns and achieve better classification performance.
 Regularization: SVM incorporates a regularization parameter (C) to control the trade-off between maximizing
the margin and minimizing the classification error on the training data. A smaller C value leads to a larger margin
but may result in more misclassifications, while a larger C value prioritizes correct classification but may lead to
overfitting.
 Sparsity of Solution: SVM solutions are typically sparse, meaning they depend only on a subset of the training
data points (support vectors) that lie on or near the decision boundary. This property makes SVM memory-
efficient and well-suited for high-dimensional datasets.
 Binary Classification: SVM is originally designed for binary classification tasks, but it can be extended to handle
multiclass classification using strategies such as one-vs-one or one-vs-rest.
SVM has applications in various domains, including image classification, text classification, bioinformatics, and finance.
Despite its effectiveness, SVM may suffer from scalability issues with large datasets and requires careful selection of
hyperparameters, such as the choice of kernel and regularization parameter.

Q.4. (b) Given a data set and set of machine algorithms, how to choose an appropriate algorithm.
Ans. Choosing an appropriate machine learning algorithm for a given dataset depends on several factors, including the
nature of the data, the problem domain, the size of the dataset, and the computational resources available. Here are
some guidelines to help choose the right algorithm:
1. Understand the Problem: Gain a clear understanding of the problem you are trying to solve, including the type
of task (classification, regression, clustering, etc.), the nature of the input features and target variable, and any
specific constraints or requirements.
2. Explore the Data: Analyze the characteristics of the dataset, such as the number of features, the distribution of
the target variable, the presence of missing values or outliers, and the level of noise in the data. Visualization
techniques can help gain insights into the data's structure and relationships.
3. Consider Algorithm Capabilities: Different machine learning algorithms have different strengths and
weaknesses. Consider the algorithm's capabilities, such as its ability to handle nonlinear relationships, high-
dimensional data, imbalanced classes, or large datasets. For example, decision trees are interpretable but may
struggle with capturing complex patterns, while deep learning models can learn intricate representations but
require large amounts of data and computational resources.
4. Evaluate Performance: Assess the performance of multiple algorithms on the dataset using appropriate
evaluation metrics and cross-validation techniques. Compare the algorithms based on metrics such as accuracy,
precision, recall, F1-score, or area under the ROC curve (AUC). It may be necessary to tune hyperparameters or
preprocessing steps for each algorithm to optimize performance.
5. Consider Interpretability and Complexity: Depending on the application requirements, consider the
interpretability of the model and its complexity. Simple models like logistic regression or decision trees may
offer interpretability but may sacrifice predictive performance compared to more complex models like ensemble
methods or deep learning models.
6. Iterate and Refine: Experiment with different algorithms, feature engineering techniques, and hyperparameter
settings iteratively. Continuously refine the model based on feedback from performance evaluation and domain
knowledge.
7. Domain Expertise: Leverage domain knowledge and expertise to guide the selection of algorithms and
interpretation of results. Domain-specific constraints or insights may influence the choice of algorithms and
preprocessing steps.
By following these guidelines and iteratively experimenting with different algorithms, you can identify the most
appropriate machine learning algorithm for your dataset and problem domain.

Q.4. (c) Logistic Regression. (2018)

You might also like